Remove a misguided WARN that attempts to detect the scenario where using
a special A/D tracking flag will set reserved bits on a non-MMIO spte.
The WARN triggers false positives when using EPT with 32-bit KVM because
of the !64-bit clause, which is just flat out wrong. The whole A/D
tracking goo is specific to EPT, and one of the big selling points of EPT
is that EPT is decoupled from the host's native paging mode.
Drop the WARN instead of trying to salvage the check. Keeping a check
specific to A/D tracking bits would essentially regurgitate the same code
that led to KVM needed the tracking bits in the first place.
A better approach would be to add a generic WARN on reserved bits being
set, which would naturally cover the A/D tracking bits, work for all
flavors of paging, and be self-documenting to some extent.
Fixes: 8a406c8953 ("KVM: x86/mmu: Rename and document A/D scheme for TDP SPTEs")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To remove code duplication, use the binary stats descriptors in the
implementation of the debugfs interface for statistics. This unifies
the definition of statistics for the binary and debugfs interfaces.
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a VCPU ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VCPU stats header,
descriptors and data.
Define VCPU statistics descriptors and header for all architectures.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-5-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a VM ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VM stats header,
descriptors and data.
Define VM statistics descriptors and header for all architectures.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-4-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit defines the API for userspace and prepare the common
functionalities to support per VM/VCPU binary stats data readings.
The KVM stats now is only accessible by debugfs, which has some
shortcomings this change series are supposed to fix:
1. The current debugfs stats solution in KVM could be disabled
when kernel Lockdown mode is enabled, which is a potential
rick for production.
2. The current debugfs stats solution in KVM is organized as "one
stats per file", it is good for debugging, but not efficient
for production.
3. The stats read/clear in current debugfs solution in KVM are
protected by the global kvm_lock.
Besides that, there are some other benefits with this change:
1. All KVM VM/VCPU stats can be read out in a bulk by one copy
to userspace.
2. A schema is used to describe KVM statistics. From userspace's
perspective, the KVM statistics are self-describing.
3. With the fd-based solution, a separate telemetry would be able
to read KVM stats in a less privileged environment.
4. After the initial setup by reading in stats descriptors, a
telemetry only needs to read the stats data itself, no more
parsing or setup is needed.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-3-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Generic KVM stats are those collected in architecture independent code
or those supported by all architectures; put all generic statistics in
a separate structure. This ensures that they are defined the same way
in the statistics API which is being added, removing duplication among
different architectures in the declaration of the descriptors.
No functional change intended.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-2-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Treat a NULL shadow page in the "is a TDP MMU" check as valid, non-TDP
root. KVM uses a "direct" PAE paging MMU when TDP is disabled and the
guest is running with paging disabled. In that case, root_hpa points at
the pae_root page (of which only 32 bytes are used), not a standard
shadow page, and the WARN fires (a lot).
Fixes: 0b873fd7fb ("KVM: x86/mmu: Remove redundant is_tdp_mmu_enabled check")
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622072454.3449146-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mark #ACs that won't be reinjected to the guest as wanted by L0 so that
KVM handles split-lock #AC from L2 instead of forwarding the exception to
L1. Split-lock #AC isn't yet virtualized, i.e. L1 will treat it like a
regular #AC and do the wrong thing, e.g. reinject it into L2.
Fixes: e6f8b6c12f ("KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest")
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622172244.3561540-1-seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the case where kvm_memslots_have_rmaps(kvm) is false the boolean
variable flush is not set and is uninitialized. If is_tdp_mmu_enabled(kvm)
is true then the call to kvm_tdp_mmu_zap_collapsible_sptes passes the
uninitialized value of flush into the call. Fix this by initializing
flush to false.
Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: e2209710cc ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622150912.23429-1-colin.king@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Failed VM-entry is often due to a faulty core. To help identify bad
cores, print the id of the last logical processor that attempted
VM-entry whenever dumping a VMCS or VMCB.
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210621221648.1833148-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The PKRU value of a task is stored in task->thread.pkru when the task is
scheduled out. PKRU is restored on schedule in from there. So keeping the
XSAVE buffer up to date is a pointless exercise.
Remove the xstate fiddling and cleanup all related functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.897372712@linutronix.de
write_pkru() was originally used just to write to the PKRU register. It
was mercifully short and sweet and was not out of place in pgtable.h with
some other pkey-related code.
But, later work included a requirement to also modify the task XSAVE
buffer when updating the register. This really is more related to the
XSAVE architecture than to paging.
Move the read/write_pkru() to asm/pkru.h. pgtable.h won't miss them.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.102647114@linutronix.de
This is not a copy functionality. It restores the register state from the
supplied kernel buffer.
No functional changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.716058365@linutronix.de
A copy is guaranteed to leave the source intact, which is not the case when
FNSAVE is used as that reinitilizes the registers.
Save does not make such guarantees and it matches what this is about,
i.e. to save the state for a later restore.
Rename it to save_fpregs_to_fpstate().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.508853062@linutronix.de
PKRU is being removed from the kernel XSAVE/FPU buffers. This removal
will probably include warnings for code that look up PKRU in those
buffers.
KVM currently looks up the location of PKRU but doesn't even use the
pointer that it gets back. Rework the code to avoid calling
get_xsave_addr() except in cases where its result is actually used.
This makes the code more clear and also avoids the inevitable PKRU
warnings.
This is probably a good cleanup and could go upstream idependently
of any PKRU rework.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.541037562@linutronix.de
Calculate the max VMCS index for vmcs12 by walking the array to find the
actual max index. Hardcoding the index is prone to bitrot, and the
calculation is only done on KVM bringup (albeit on every CPU, but there
aren't _that_ many null entries in the array).
Fixes: 3c0f99366e ("KVM: nVMX: Add a TSC multiplier field in VMCS12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210618214658.2700765-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As part of smaller maxphyaddr emulation, kvm needs to intercept
present page faults to see if it needs to add the RSVD flag (bit 3) to
the error code. However, there is no need to intercept page faults
that already have the RSVD flag set. When setting up the page fault
intercept, add the RSVD flag into the #PF error code mask field (but
not the #PF error code match field) to skip the intercept when the
RSVD flag is already set.
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210618235941.1041604-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The root_hpa checks below the top-level check in kvm_mmu_page_fault are
theoretically redundant since there is no longer a way for the root_hpa
to be reset during a page fault. The details of why are described in
commit ddce620821 ("KVM: x86/mmu: Move root_hpa validity checks to top
of page fault handler")
__direct_map, kvm_tdp_mmu_map, and get_mmio_spte are all only reachable
through kvm_mmu_page_fault, therefore their root_hpa checks are
redundant.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210617231948.2591431-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change simplifies the call sites slightly and also abstracts away
the implementation detail of looking at root_hpa as the mechanism for
determining if the mmu is the TDP MMU.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210617231948.2591431-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This check is redundant because the root shadow page will only be a TDP
MMU page if is_tdp_mmu_enabled() returns true, and is_tdp_mmu_enabled()
never changes for the lifetime of a VM.
It's possible that this check was added for performance reasons but it
is unlikely that it is useful in practice since to_shadow_page() is
cheap. That being said, this patch also caches the return value of
is_tdp_mmu_root() in direct_page_fault() since there's no reason to
duplicate the call so many times, so performance is not a concern.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210617231948.2591431-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The check for is_tdp_mmu_root in kvm_tdp_mmu_map is redundant because
kvm_tdp_mmu_map's only caller (direct_page_fault) already checks
is_tdp_mmu_root.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210617231948.2591431-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If is_tdp_mmu_root is not inlined, the elimination of TDP MMU calls as dead
code might not work out. To avoid this, explicitly declare the stubbed
is_tdp_mmu_root on 32-bit hosts.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN if NX is reported as supported but not enabled in EFER. All flavors
of the kernel, including non-PAE 32-bit kernels, set EFER.NX=1 if NX is
supported, even if NX usage is disable via kernel command line. KVM relies
on NX being enabled if it's supported, e.g. KVM will generate illegal NPT
entries if nx_huge_pages is enabled and NX is supported but not enabled.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210615164535.2146172-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refuse to load KVM if NX support is not available. Shadow paging has
assumed NX support since commit 9167ab7993 ("KVM: vmx, svm: always run
with EFER.NXE=1 when shadow paging is active"), and NPT has assumed NX
support since commit b8e8c8303f ("kvm: mmu: ITLB_MULTIHIT mitigation").
While the NX huge pages mitigation should not be enabled by default for
AMD CPUs, it can be turned on by userspace at will.
Unlike Intel CPUs, AMD does not provide a way for firmware to disable NX
support, and Linux always sets EFER.NX=1 if it is supported. Given that
it's extremely unlikely that a CPU supports NPT but not NX, making NX a
formal requirement is far simpler than adding requirements to the
mitigation flow.
Fixes: 9167ab7993 ("KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active")
Fixes: b8e8c8303f ("kvm: mmu: ITLB_MULTIHIT mitigation")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210615164535.2146172-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refuse to load KVM if NX support is not available and EPT is not enabled.
Shadow paging has assumed NX support since commit 9167ab7993 ("KVM:
vmx, svm: always run with EFER.NXE=1 when shadow paging is active"), so
for all intents and purposes this has been a de facto requirement for
over a year.
Do not require NX support if EPT is enabled purely because Intel CPUs let
firmware disable NX support via MSR_IA32_MISC_ENABLES. If not for that,
VMX (and KVM as a whole) could require NX support with minimal risk to
breaking userspace.
Fixes: 9167ab7993 ("KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210615164535.2146172-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit in sched/urgent moved the cfs_rq_is_decayed() function:
a7b359fc6a: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
and this fresh commit in sched/core modified it in the old location:
9e077b52d8: ("sched/pelt: Check that *_avg are null when *_sum are")
Merge the two variants.
Conflicts:
kernel/sched/fair.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
reported by syzkaller ("KVM: x86: Immediately reset the MMU context when the SMM
flag is cleared").
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDLldwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPTOgf/XpAehLdWlx2877ulcBD0Kjt0tLvH
OFHRD1Ir0d1Ay3DX8qmxLquXHB4CoDGZBwi1d7AI165kUP/XLmV0bY6TZ74inI/P
CaD8Bsbm8+iBl5jrovEPc+223bK+3OFOvo2pk6M/MlsO/ExRikaPDtHOnkfousbl
nLX8v2qd7ihWyJOdLJMU9pV8E2iczQoCuH9yWBHdCrxRxWtPzkEekPWb0sujByiF
4tD7sqiEA3ugbF1Wm5keQV63NLplfxx+Zun0FV+tbpjjxQWAGl81dP+zmqok0sM/
qQCyZevt6jLLkL+Fn6hI6PP9OTeYreX2fgwhWXs71d2js33yNg5Veqx5Bw==
=Gs/y
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Miscellaneous bugfixes.
The main interesting one is a NULL pointer dereference reported by
syzkaller ("KVM: x86: Immediately reset the MMU context when the SMM
flag is cleared")"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: selftests: Fix kvm_check_cap() assertion
KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU
KVM: X86: Fix x86_emulator slab cache leak
KVM: SVM: Call SEV Guest Decommission if ASID binding fails
KVM: x86: Immediately reset the MMU context when the SMM flag is cleared
KVM: x86: Fix fall-through warnings for Clang
KVM: SVM: fix doc warnings
KVM: selftests: Fix compiling errors when initializing the static structure
kvm: LAPIC: Restore guard to prevent illegal APIC register access
TDP MMU iterator's level is identical to page table's actual level. For
instance, for the last level page table (whose entry points to one 4K
page), iter->level is 1 (PG_LEVEL_4K), and in case of 5 level paging,
the iter->level is mmu->shadow_root_level, which is 5. However, struct
kvm_mmu_page's level currently is not set correctly when it is allocated
in kvm_tdp_mmu_map(). When iterator hits non-present SPTE and needs to
allocate a new child page table, currently iter->level, which is the
level of the page table where the non-present SPTE belongs to, is used.
This results in struct kvm_mmu_page's level always having its parent's
level (excpet root table's level, which is initialized explicitly using
mmu->shadow_root_level).
This is kinda wrong, and not consistent with existing non TDP MMU code.
Fortuantely sp->role.level is only used in handle_removed_tdp_mmu_page()
and kvm_tdp_mmu_zap_sp(), and they are already aware of this and behave
correctly. However to make it consistent with legacy MMU code (and fix
the issue that both root page table and its child page table have
shadow_root_level), use iter->level - 1 in kvm_tdp_mmu_map(), and change
handle_removed_tdp_mmu_page() and kvm_tdp_mmu_zap_sp() accordingly.
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <bcb6569b6e96cb78aaa7b50640e6e6b53291a74e.1623717884.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently pf_fixed is not increased when prefault is true. This is not
correct, since prefault here really means "async page fault completed".
In that case, the original page fault from the guest was morphed into as
async page fault and pf_fixed was not increased. So when prefault
indicates async page fault is completed, pf_fixed should be increased.
Additionally, currently pf_fixed is also increased even when page fault
is spurious, while legacy MMU increases pf_fixed when page fault returns
RET_PF_EMULATE or RET_PF_FIXED.
To fix above two issues, change to increase pf_fixed when return value
is not RET_PF_SPURIOUS (RET_PF_RETRY has already been ruled out by
reaching here).
More information:
https://lore.kernel.org/kvm/cover.1620200410.git.kai.huang@intel.com/T/#mbb5f8083e58a2cd262231512b9211cbe70fc3bd5
Fixes: bb18842e21 ("kvm: x86/mmu: Add TDP MMU PF handler")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <2ea8b7f5d4f03c99b32bc56fc982e1e4e3d3fc6b.1623717884.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently tdp_mmu_map_handle_target_level() returns 0, which is
RET_PF_RETRY, when page fault is actually fixed. This makes
kvm_tdp_mmu_map() also return RET_PF_RETRY in this case, instead of
RET_PF_FIXED. Fix by initializing ret to RET_PF_FIXED.
Note that kvm_mmu_page_fault() resumes guest on both RET_PF_RETRY and
RET_PF_FIXED, which means in practice returning the two won't make
difference, so this fix alone won't be necessary for stable tree.
Fixes: bb18842e21 ("kvm: x86/mmu: Add TDP MMU PF handler")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <f9e8956223a586cd28c090879a8ff40f5eb6d609.1623717884.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_GET_LAPIC stores the current value of TMCCT and KVM_SET_LAPIC's memcpy
stores it in vcpu->arch.apic->regs, KVM_SET_LAPIC could store zero in
vcpu->arch.apic->regs after it uses it, and then the stored value would
always be zero. In addition, the TMCCT is always computed on-demand and
never directly readable.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1623223000-18116-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This hypercall is used by the SEV guest to notify a change in the page
encryption status to the hypervisor. The hypercall should be invoked
only when the encryption attribute is changed from encrypted -> decrypted
and vice versa. By default all guest pages are considered encrypted.
The hypercall exits to userspace to manage the guest shared regions and
integrate with the userspace VMM's migration code.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Snapshot kvm->stats.nx_lpage_splits into a local unsigned long to avoid
64-bit division on 32-bit kernels. Casting to an unsigned long is safe
because the maximum number of shadow pages, n_max_mmu_pages, is also an
unsigned long, i.e. KVM will start recycling shadow pages before the
number of splits can exceed a 32-bit value.
ERROR: modpost: "__udivdi3" [arch/x86/kvm/kvm.ko] undefined!
Fixes: 7ee093d4f3f5 ("KVM: switch per-VM stats to u64")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210615162905.2132937-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When APICv is active, interrupt injection doesn't raise KVM_REQ_EVENT
request (see __apic_accept_irq()) as the required work is done by hardware.
In case KVM_REQ_APICV_UPDATE collides with such injection, the interrupt
may never get delivered.
Currently, the described situation is hardly possible: all
kvm_request_apicv_update() calls normally happen upon VM creation when
no interrupts are pending. We are, however, going to move unconditional
kvm_request_apicv_update() call from kvm_hv_activate_synic() to
synic_update_vector() and without this fix 'hyperv_connections' test from
kvm-unit-tests gets stuck on IPI delivery attempt right after configuring
a SynIC route which triggers APICv disablement.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the explicit check on EPTP switching being enabled. The EPTP
switching check is handled in the generic VMFUNC function check, while
the underlying VMFUNC enablement check is done by hardware and redone
by generic VMFUNC emulation.
The vmcs12 EPT check is handled by KVM at VM-Enter in the form of a
consistency check, keep it but add a WARN.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN and inject #UD when emulating VMFUNC for L2 if the function is
out-of-bounds or if VMFUNC is not enabled in vmcs12. Neither condition
should occur in practice, as the CPU is supposed to prioritize the #UD
over VM-Exit for out-of-bounds input and KVM is supposed to enable
VMFUNC in vmcs02 if and only if it's enabled in vmcs12, but neither of
those dependencies is obvious.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the @reset_roots param from kvm_init_mmu(), the one user,
kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
invalidated all roots. This also happens to be why the reset_roots=true
paths doesn't leak roots; they're already invalid.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Defer the MMU sync on PCID invalidation so that multiple sync requests in
a single VM-Exit are batched. This is a very minor optimization as
checking for unsync'd children is quite cheap.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use __kvm_mmu_new_pgd() via kvm_init_shadow_ept_mmu() to emulate
VMFUNC[EPTP_SWITCH] instead of nuking all MMUs. EPTP_SWITCH is the EPT
equivalent of MOV to CR3, i.e. is a perfect fit for the common PGD flow,
the only hiccup being that A/D enabling is buried in the EPTP. But, that
is easily handled by bouncing through kvm_init_shadow_ept_mmu().
Explicitly request a guest TLB flush if VPID is disabled. Per Intel's
SDM, if VPID is disabled, "an EPTP-switching VMFUNC invalidates combined
mappings associated with VPID 0000H (for all PCIDs and for all EP4TA
values, where EP4TA is the value of bits 51:12 of EPTP)".
Note, this technically is a very bizarre bug fix of sorts if L2 is using
PAE paging, as avoiding the full MMU reload also avoids incorrectly
reloading the PDPTEs, which the SDM explicitly states are not touched:
If PAE paging is in use, an EPTP-switching VMFUNC does not load the
four page-directory-pointer-table entries (PDPTEs) from the
guest-physical address in CR3. The logical processor continues to use
the four guest-physical addresses already present in the PDPTEs. The
guest-physical address in CR3 is not translated through the new EPT
paging structures (until some operation that would load the PDPTEs).
In addition to optimizing L2's MMU shenanigans, avoiding the full reload
also optimizes L1's MMU as KVM_REQ_MMU_RELOAD wipes out all roots in both
root_mmu and guest_mmu.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use KVM_REQ_TLB_FLUSH_GUEST instead of KVM_REQ_MMU_RELOAD when emulating
INVPCID of all contexts. In the current code, this is a glorified nop as
TLB_FLUSH_GUEST becomes kvm_mmu_unload(), same as MMU_RELOAD, when TDP
is disabled, which is the only time INVPCID is only intercepted+emulated.
In the future, reusing TLB_FLUSH_GUEST will simplify optimizing paths
that emulate a guest TLB flush, e.g. by synchronizing as needed instead
of completely unloading all MMUs.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When emulating INVVPID for L1, free only L2+ roots, using the guest_mode
tag in the MMU role to identify L2+ roots. From L1's perspective, its
own TLB entries use VPID=0, and INVVPID is not requied to invalidate such
entries. Per Intel's SDM, INVVPID _may_ invalidate entries with VPID=0,
but it is not required to do so.
Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the dedicated nested_vmx_transition_mmu_sync() now that the MMU sync
is handled via KVM_REQ_TLB_FLUSH_GUEST, and fold that flush into the
all-encompassing nested_vmx_transition_tlb_flush().
Opportunistically add a comment explaning why nested EPT never needs to
sync the MMU on VM-Enter.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that
all call sites unconditionally skip both the sync and flush.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce nested_svm_transition_tlb_flush() and use it force an MMU sync
and TLB flush on nSVM VM-Enter and VM-Exit instead of sneaking the logic
into the __kvm_mmu_new_pgd() call sites. Add a partial todo list to
document issues that need to be addressed before the unconditional sync
and flush can be modified to look more like nVMX's logic.
In addition to making nSVM's forced flushing more overt (guess who keeps
losing track of it), the new helper brings further convergence between
nSVM and nVMX, and also sets the stage for dropping the "skip" params
from __kvm_mmu_new_pgd().
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Stop leveraging the MMU sync and TLB flush requested by the fast PGD
switch helper now that kvm_set_cr3() manually handles the necessary sync,
frees, and TLB flush. This will allow dropping the params from the fast
PGD helpers since nested SVM is now the odd blob out.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Flush and sync all PGDs for the current/target PCID on MOV CR3 with a
TLB flush, i.e. without PCID_NOFLUSH set. Paraphrasing Intel's SDM
regarding the behavior of MOV to CR3:
- If CR4.PCIDE = 0, invalidates all TLB entries associated with PCID
000H and all entries in all paging-structure caches associated with
PCID 000H.
- If CR4.PCIDE = 1 and NOFLUSH=0, invalidates all TLB entries
associated with the PCID specified in bits 11:0, and all entries in
all paging-structure caches associated with that PCID. It is not
required to invalidate entries in the TLBs and paging-structure
caches that are associated with other PCIDs.
- If CR4.PCIDE=1 and NOFLUSH=1, is not required to invalidate any TLB
entries or entries in paging-structure caches.
Extract and reuse the logic for INVPCID(single) which is effectively the
same flow and works even if CR4.PCIDE=0, as the current PCID will be '0'
in that case, thus honoring the requirement of flushing PCID=0.
Continue passing skip_tlb_flush to kvm_mmu_new_pgd() even though it
_should_ be redundant; the clean up will be done in a future patch. The
overhead of an unnecessary nop sync is minimal (especially compared to
the actual sync), and the TLB flush is handled via request. Avoiding the
the negligible overhead is not worth the risk of breaking kernels that
backport the fix.
Fixes: 956bf3531f ("kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest")
Cc: Junaid Shahid <junaids@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop bogus logic that incorrectly clobbers the accessed/dirty enabling
status of the nested MMU on an EPTP switch. When nested EPT is enabled,
walk_mmu points at L2's _legacy_ page tables, not L1's EPT for L2.
This is likely a benign bug, as mmu->ept_ad is never consumed (since the
MMU is not a nested EPT MMU), and stuffing mmu_role.base.ad_disabled will
never propagate into future shadow pages since the nested MMU isn't used
to map anything, just to walk L2's page tables.
Note, KVM also does a full MMU reload, i.e. the guest_mmu will be
recreated using the new EPTP, and thus any change in A/D enabling will be
properly recognized in the relevant MMU.
Fixes: 41ab937274 ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use BIT_ULL() instead of an open-coded shift to check whether or not a
function is enabled in L1's VMFUNC bitmap. This is a benign bug as KVM
supports only bit 0, and will fail VM-Enter if any other bits are set,
i.e. bits 63:32 are guaranteed to be zero.
Note, "function" is bounded by hardware as VMFUNC will #UD before taking
a VM-Exit if the function is greater than 63.
Before:
if ((vmcs12->vm_function_control & (1 << function)) == 0)
0x000000000001a916 <+118>: mov $0x1,%eax
0x000000000001a91b <+123>: shl %cl,%eax
0x000000000001a91d <+125>: cltq
0x000000000001a91f <+127>: and 0x128(%rbx),%rax
After:
if (!(vmcs12->vm_function_control & BIT_ULL(function & 63)))
0x000000000001a955 <+117>: mov 0x128(%rbx),%rdx
0x000000000001a95c <+124>: bt %rax,%rdx
Fixes: 27c42a1bb8 ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Trigger a full TLB flush on behalf of the guest on nested VM-Enter and
VM-Exit when VPID is disabled for L2. kvm_mmu_new_pgd() syncs only the
current PGD, which can theoretically leave stale, unsync'd entries in a
previous guest PGD, which could be consumed if L2 is allowed to load CR3
with PCID_NOFLUSH=1.
Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can
be utilized for its obvious purpose of emulating a guest TLB flush.
Note, there is no change the actual TLB flush executed by KVM, even
though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT. When VPID is
disabled for L2, vpid02 is guaranteed to be '0', and thus
nested_get_vpid02() will return the VPID that is shared by L1 and L2.
Generate the request outside of kvm_mmu_new_pgd(), as getting the common
helper to correctly identify which requested is needed is quite painful.
E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as
a TLB flush from the L1 kernel's perspective does not invalidate EPT
mappings. And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future
simplification by moving the logic into nested_vmx_transition_tlb_flush().
Fixes: 41fab65e7c ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMCS12 is used to keep the authoritative state during nested state
migration. In case 'need_vmcs12_to_shadow_sync' flag is set, we're
in between L2->L1 vmexit and L1 guest run when actual sync to
enlightened (or shadow) VMCS happens. Nested state, however, has
no flag for 'need_vmcs12_to_shadow_sync' so vmx_set_nested_state()->
set_current_vmptr() always sets it. Enlightened vmptrld path, however,
doesn't have the quirk so some VMCS12 changes may not get properly
reflected to eVMCS and L1 will see an incorrect state.
Note, during L2 execution or when need_vmcs12_to_shadow_sync is not
set the change is effectively a nop: in the former case all changes
will get reflected during the first L2->L1 vmexit and in the later
case VMCS12 and eVMCS are already in sync (thanks to
copy_enlightened_to_vmcs12() in vmx_get_nested_state()).
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-11-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When nested state migration happens during L1's execution, it
is incorrect to modify eVMCS as it is L1 who 'owns' it at the moment.
At least genuine Hyper-V seems to not be very happy when 'clean fields'
data changes underneath it.
'Clean fields' data is used in KVM twice: by copy_enlightened_to_vmcs12()
and prepare_vmcs02_rare() so we can reset it from prepare_vmcs02() instead.
While at it, update a comment stating why exactly we need to reset
'hv_clean_fields' data from L0.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-10-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'need_vmcs12_to_shadow_sync' is used for both shadow and enlightened
VMCS sync when we exit to L1. The comment in nested_vmx_failValid()
validly states why shadow vmcs sync can be omitted but this doesn't
apply to enlightened VMCS as it 'shadows' all VMCS12 fields.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1
hypervisor is not obliged to keep it up-to-date while it is mangling L2's
state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual
eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration
VMCS12 is used as a source of ultimate truth, we must make sure we pick all
the changes to eVMCS and thus 'clean fields' data must be ignored.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-8-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unlike VMREAD/VMWRITE/VMPTRLD, VMCLEAR is a valid instruction when
enlightened VMCS is in use. TLFS has the following brief description:
"The L1 hypervisor can execute a VMCLEAR instruction to transition an
enlightened VMCS from the active to the non-active state". Normally,
this change can be ignored as unmapping active eVMCS can be postponed
until the next VMLAUNCH instruction but in case nested state is migrated
with KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, keeping eVMCS mapped
may result in its synchronization with VMCS12 and this is incorrect:
L1 hypervisor is free to reuse inactive eVMCS memory for something else.
Inactive eVMCS after VMCLEAR can just be unmapped.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-7-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
can not be called directly from vmx_set_nested_state() as KVM may not have
all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
restored yet). Enlightened VMCS is mapped later while getting nested state
pages. In the meantime, vmx->nested.hv_evmcs_vmptr remains 'EVMPTR_INVALID'
and it's indistinguishable from 'evmcs is not in use' case. This leads to
certain issues, in particular, if KVM_GET_NESTED_STATE is called right
after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the resulting
state will be unset (and such state will later fail to load).
Introduce 'EVMPTR_MAP_PENDING' state to detect not-yet-mapped eVMCS after
restore. With this, the 'is_guest_mode(vcpu)' hack in vmx_has_valid_vmcs12()
is no longer needed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() don't return any result,
make them return 'void'.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In theory, L1 can try to disable enlightened VMENTRY in VP assist page and
try to issue VMLAUNCH/VMRESUME. While nested_vmx_handle_enlightened_vmptrld()
properly handles this as 'EVMPTRLD_DISABLED', previously mapped eVMCS
remains mapped and thus all evmptr_is_valid() checks will still pass and
nested_vmx_run() will proceed when it shouldn't.
Release eVMCS immediately when we detect that enlightened vmentry was
disabled by L1.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'dirty_vmcs12' is only checked in prepare_vmcs02_early()/prepare_vmcs02()
and both checks look like:
'vmx->nested.dirty_vmcs12 || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)'
so for eVMCS case the flag changes nothing. Drop the assignment to avoid
the confusion.
No functional change intended.
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of checking 'vmx->nested.hv_evmcs' use '-1' in
'vmx->nested.hv_evmcs_vmptr' to indicate 'evmcs is not in use' state. This
matches how we check 'vmx->nested.current_vmptr'. Introduce EVMPTR_INVALID
and evmptr_is_valid() and use it instead of raw '-1' check as a preparation
to adding other 'special' values.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
a part of the migration state and are correctly
restored by those ioctls.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is a new version of KVM_GET_SREGS / KVM_SET_SREGS.
It has the following changes:
* Has flags for future extensions
* Has vcpu's PDPTRs, allowing to save/restore them on migration.
* Lacks obsolete interrupt bitmap (done now via KVM_SET_VCPU_EVENTS)
New capability, KVM_CAP_SREGS2 is added to signal
the userspace of this ioctl.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Small refactoring that will be used in the next patch.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to the rest of guest page accesses after a migration,
this access should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Document the actual reason why we need to do it
on migration and move the call to svm_set_nested_state
to be closer to VMX code.
To avoid loading the PDPTRs from possibly not up to date memory map,
in nested_svm_load_cr3 after the move, move this code to
.get_nested_state_pages.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Kill off pdptrs_changed() and instead go through the full kvm_set_cr3()
for PAE guest, even if the new CR3 is the same as the current CR3. For
VMX, and SVM with NPT enabled, the PDPTRs are unconditionally marked as
unavailable after VM-Exit, i.e. the optimization is dead code except for
SVM without NPT.
In the unlikely scenario that anyone cares about SVM without NPT _and_ a
PAE guest, they've got bigger problems if their guest is loading the same
CR3 so frequently that the performance of kvm_set_cr3() is notable,
especially since KVM's fast PGD switching means reloading the same CR3
does not require a full rebuild. Given that PAE and PCID are mutually
exclusive, i.e. a sync and flush are guaranteed in any case, the actual
benefits of the pdptrs_changed() optimization are marginal at best.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210607090203.133058-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the "PDPTRs unchanged" check to skip PDPTR loading during nested
SVM transitions as it's not at all an optimization. Reading guest memory
to get the PDPTRs isn't magically cheaper by doing it in pdptrs_changed(),
and if the PDPTRs did change, KVM will end up doing the read twice.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210607090203.133058-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the pdptrs_changed() check when loading L2's CR3. The set of
available registers is always reset when switching VMCSes (see commit
e5d03de593, "KVM: nVMX: Reset register cache (available and dirty
masks) on VMCS switch"), thus the "are PDPTRs available" check will
always fail. And even if it didn't fail, reading guest memory to check
the PDPTRs is just as expensive as reading guest memory to load 'em.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210607090203.133058-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hypercalls which use extended processor masks are only available when
HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED privilege bit is exposed (and
'RECOMMENDED' is rather a misnomer).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-28-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V partition must possess 'HV_X64_CLUSTER_IPI_RECOMMENDED'
privilege ('recommended' is rather a misnomer) to issue
HVCALL_SEND_IPI hypercalls.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-27-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V partition must possess 'HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED'
privilege ('recommended' is rather a misnomer) to issue
HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST/SPACE hypercalls.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-26-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V partition must possess 'HV_DEBUGGING' privilege to issue
HVCALL_POST_DEBUG_DATA/HVCALL_RETRIEVE_DEBUG_DATA/
HVCALL_RESET_DEBUG_SESSION hypercalls.
Note, when SynDBG is disabled hv_check_hypercall_access() returns
'true' (like for any other unknown hypercall) so the result will
be HV_STATUS_INVALID_HYPERCALL_CODE and not HV_STATUS_ACCESS_DENIED.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-25-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
TLFS6.0b states that partition issuing HVCALL_NOTIFY_LONG_SPIN_WAIT must
posess 'UseHypercallForLongSpinWait' privilege but there's no
corresponding feature bit. Instead, we have "Recommended number of attempts
to retry a spinlock failure before notifying the hypervisor about the
failures. 0xFFFFFFFF indicates never notify." Use this to check access to
the hypercall. Also, check against zero as the corresponding CPUID must
be set (and '0' attempts before re-try is weird anyway).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-22-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce hv_check_hypercallr_access() to check if the particular hypercall
should be available to guest, this will be used with
KVM_CAP_HYPERV_ENFORCE_CPUID mode.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-21-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synthetic timers can only be configured in 'direct' mode when
HV_STIMER_DIRECT_MODE_AVAILABLE bit was exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-20-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Access to all MSRs is now properly checked. To avoid 'forgetting' to
properly check access to new MSRs in the future change the default
to 'false' meaning 'no access'.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-19-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synthetic debugging MSRs (HV_X64_MSR_SYNDBG_CONTROL,
HV_X64_MSR_SYNDBG_STATUS, HV_X64_MSR_SYNDBG_SEND_BUFFER,
HV_X64_MSR_SYNDBG_RECV_BUFFER, HV_X64_MSR_SYNDBG_PENDING_BUFFER,
HV_X64_MSR_SYNDBG_OPTIONS) are only available to guest when
HV_FEATURE_DEBUG_MSRS_AVAILABLE bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-18-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4, HV_X64_MSR_CRASH_CTL are only
available to guest when HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE bit is
exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-17-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_REENLIGHTENMENT_CONTROL/HV_X64_MSR_TSC_EMULATION_CONTROL/
HV_X64_MSR_TSC_EMULATION_STATUS are only available to guest when
HV_ACCESS_REENLIGHTENMENT bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-16-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_TSC_FREQUENCY/HV_X64_MSR_APIC_FREQUENCY are only available to
guest when HV_ACCESS_FREQUENCY_MSRS bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-15-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_EOI, HV_X64_MSR_ICR, HV_X64_MSR_TPR, and
HV_X64_MSR_VP_ASSIST_PAGE are only available to guest when
HV_MSR_APIC_ACCESS_AVAILABLE bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-14-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synthetic timers MSRs (HV_X64_MSR_STIMER[0-3]_CONFIG,
HV_X64_MSR_STIMER[0-3]_COUNT) are only available to guest when
HV_MSR_SYNTIMER_AVAILABLE bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-13-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SynIC MSRs (HV_X64_MSR_SCONTROL, HV_X64_MSR_SVERSION, HV_X64_MSR_SIEFP,
HV_X64_MSR_SIMP, HV_X64_MSR_EOM, HV_X64_MSR_SINT0 ... HV_X64_MSR_SINT15)
are only available to guest when HV_MSR_SYNIC_AVAILABLE bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-12-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_REFERENCE_TSC is only available to guest when
HV_MSR_REFERENCE_TSC_AVAILABLE bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-11-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_RESET is only available to guest when HV_MSR_RESET_AVAILABLE bit
is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-10-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_VP_INDEX is only available to guest when
HV_MSR_VP_INDEX_AVAILABLE bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_TIME_REF_COUNT is only available to guest when
HV_MSR_TIME_REF_COUNT_AVAILABLE bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-8-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_VP_RUNTIME is only available to guest when
HV_MSR_VP_RUNTIME_AVAILABLE bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-7-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_GUEST_OS_ID/HV_X64_MSR_HYPERCALL are only available to guest
when HV_MSR_HYPERCALL_AVAILABLE bit is exposed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce hv_check_msr_access() to check if the particular MSR
should be accessible by guest, this will be used with
KVM_CAP_HYPERV_ENFORCE_CPUID mode.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Limiting exposed Hyper-V features requires a fast way to check if the
particular feature is exposed in guest visible CPUIDs or not. To aboid
looping through all CPUID entries on every hypercall/MSR access cache
the required leaves on CPUID update.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Modeled after KVM_CAP_ENFORCE_PV_FEATURE_CPUID, the new capability allows
for limiting Hyper-V features to those exposed to the guest in Hyper-V
CPUIDs (0x40000003, 0x40000004, ...).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
From Hyper-V TLFS:
"The hypervisor exposes hypercalls (HvFlushVirtualAddressSpace,
HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressList, and
HvFlushVirtualAddressListEx) that allow operating systems to more
efficiently manage the virtual TLB. The L1 hypervisor can choose to
allow its guest to use those hypercalls and delegate the responsibility
to handle them to the L0 hypervisor. This requires the use of a
partition assist page."
Add the Direct Virtual Flush support for SVM.
Related VMX changes:
commit 6f6a657c99 ("KVM/Hyper-V/VMX: Add direct tlb flush support")
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Message-Id: <fc8d24d8eb7017266bb961e39a171b0caf298d7f.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Enlightened MSR-Bitmap as per TLFS:
"The L1 hypervisor may collaborate with the L0 hypervisor to make MSR
accesses more efficient. It can enable enlightened MSR bitmaps by setting
the corresponding field in the enlightened VMCS to 1. When enabled, L0
hypervisor does not monitor the MSR bitmaps for changes. Instead, the L1
hypervisor must invalidate the corresponding clean field after making
changes to one of the MSR bitmaps."
Enable this for SVM.
Related VMX changes:
commit ceef7d10df ("KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support")
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Message-Id: <87df0710f95d28b91cc4ea014fc4d71056eebbee.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SVM added support for certain reserved fields to be used by
software or hypervisor. Add the following reserved fields:
- VMCB offset 0x3e0 - 0x3ff
- Clean bit 31
- SVM intercept exit code 0xf0000000
Later patches will make use of this for supporting Hyper-V
nested virtualization enhancements.
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Message-Id: <a1f17a43a8e9e751a1a9cc0281649d71bdbf721b.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently the remote TLB flush logic is specific to VMX.
Move it to a common place so that SVM can use it as well.
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add the following per-VCPU statistic to KVM debugfs to show if a given
VCPU is in guest mode:
guest_mode
Also add this as a per-VM statistic to KVM debugfs to show the total number
of VCPUs that are in guest mode in a given VM.
Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com>
Message-Id: <20210609180340.104248-3-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the 'nested_run' statistic counts all guest-entry attempts,
including those that fail during vmentry checks on Intel and during
consistency checks on AMD. Convert this statistic to count only those
guest-entries that make it past these state checks and make it to guest
code. This will tell us the number of guest-entries that actually executed
or tried to execute guest code.
Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com>
Message-Id: <20210609180340.104248-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that .post_leave_smm() is gone, drop "pre_" from the remaining
helpers. The helpers aren't invoked purely before SMI/RSM processing,
e.g. both helpers are invoked after state is snapshotted (from regs or
SMRAM), and the RSM helper is invoked after some amount of register state
has been stuffed.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the .post_leave_smm() emulator callback, which at this point is just
a wrapper to kvm_mmu_reset_context(). The manual context reset is
unnecessary, because unlike enter_smm() which calls vendor MSR/CR helpers
directly, em_rsm() bounces through the KVM helpers, e.g. kvm_set_cr4(),
which are responsible for processing side effects. em_rsm() is already
subtly relying on this behavior as it doesn't manually do
kvm_update_cpuid_runtime(), e.g. to recognize CR4.OSXSAVE changes.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename the SMM tracepoint, which handles both entering and exiting SMM,
from kvm_enter_smm to kvm_smm_transition.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Invoke the "entering SMM" tracepoint from kvm_smm_changed() instead of
enter_smm(), effectively moving it from before reading vCPU state to
after reading state (but still before writing it to SMRAM!). The primary
motivation is to consolidate code, but calling the tracepoint from
kvm_smm_changed() also makes its invocation consistent with respect to
SMI and RSM, and with respect to KVM_SET_VCPU_EVENTS (which previously
only invoked the tracepoint when forcing the vCPU out of SMM).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the core of SMM hflags modifications into kvm_smm_changed() and use
kvm_smm_changed() in enter_smm(). Clear HF_SMM_INSIDE_NMI_MASK for
leaving SMM but do not set it for entering SMM. If the vCPU is executing
outside of SMM, the flag should unequivocally be cleared, e.g. this
technically fixes a benign bug where the flag could be left set after
KVM_SET_VCPU_EVENTS, but the reverse is not true as NMI blocking depends
on pre-SMM state or userspace input.
Note, this adds an extra kvm_mmu_reset_context() to enter_smm(). The
extra/early reset isn't strictly necessary, and in a way can never be
necessary since the vCPU/MMU context is in a half-baked state until the
final context reset at the end of the function. But, enter_smm() is not
a hot path, and exploding on an invalid root_hpa is probably better than
having a stale SMM flag in the MMU role; it's at least no worse.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move RSM emulation's call to kvm_smm_changed() from .post_leave_smm() to
.exiting_smm(), leaving behind the MMU context reset. The primary
motivation is to allow for future cleanup, but this also fixes a bug of
sorts by queueing KVM_REQ_EVENT even if RSM causes shutdown, e.g. to let
an INIT wake the vCPU from shutdown. Of course, KVM doesn't properly
emulate a shutdown state, e.g. KVM doesn't block SMIs after shutdown, and
immediately exits to userspace, so the event request is a moot point in
practice.
Moving kvm_smm_changed() also moves the RSM tracepoint. This isn't
strictly necessary, but will allow consolidating the SMI and RSM
tracepoints in a future commit (by also moving the SMI tracepoint).
Invoking the tracepoint before loading SMRAM state also means the SMBASE
that reported in the tracepoint will point that the state that will be
used for RSM, as opposed to the SMBASE _after_ RSM completes, which is
arguably a good thing if the tracepoint is being used to debug a RSM/SMM
issue.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the .set_hflags() emulator hook with a dedicated .exiting_smm(),
moving the SMM and SMM_INSIDE_NMI flag handling out of the emulator in
the process. This is a step towards consolidating much of the logic in
kvm_smm_changed(), including the SMM hflags updates.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the recently introduced KVM_REQ_TRIPLE_FAULT to properly emulate
shutdown if RSM from SMM fails.
Note, entering shutdown after clearing the SMM flag and restoring NMI
blocking is architecturally correct with respect to AMD's APM, which KVM
also uses for SMRAM layout and RSM NMI blocking behavior. The APM says:
An RSM causes a processor shutdown if an invalid-state condition is
found in the SMRAM state-save area. Only an external reset, external
processor-initialization, or non-maskable external interrupt (NMI) can
cause the processor to leave the shutdown state.
Of note is processor-initialization (INIT) as a valid shutdown wake
event, as INIT is blocked by SMM, implying that entering shutdown also
forces the CPU out of SMM.
For recent Intel CPUs, restoring NMI blocking is technically wrong, but
so is restoring NMI blocking in the first place, and Intel's RSM
"architecture" is such a mess that just about anything is allowed and can
be justified as micro-architectural behavior.
Per the SDM:
On Pentium 4 and later processors, shutdown will inhibit INTR and A20M
but will not change any of the other inhibits. On these processors,
NMIs will be inhibited if no action is taken in the SMI handler to
uninhibit them (see Section 34.8).
where Section 34.8 says:
When the processor enters SMM while executing an NMI handler, the
processor saves the SMRAM state save map but does not save the
attribute to keep NMI interrupts disabled. Potentially, an NMI could be
latched (while in SMM or upon exit) and serviced upon exit of SMM even
though the previous NMI handler has still not completed.
I.e. RSM unconditionally unblocks NMI, but shutdown on RSM does not,
which is in direct contradiction of KVM's behavior. But, as mentioned
above, KVM follows AMD architecture and restores NMI blocking on RSM, so
that micro-architectural detail is already lost.
And for Pentium era CPUs, SMI# can break shutdown, meaning that at least
some Intel CPUs fully leave SMM when entering shutdown:
In the shutdown state, Intel processors stop executing instructions
until a RESET#, INIT# or NMI# is asserted. While Pentium family
processors recognize the SMI# signal in shutdown state, P6 family and
Intel486 processors do not.
In other words, the fact that Intel CPUs have implemented the two
extremes gives KVM carte blanche when it comes to honoring Intel's
architecture for handling shutdown during RSM.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-3-seanjc@google.com>
[Return X86EMUL_CONTINUE after triple fault. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that APICv/AVIC enablement is kept in common 'enable_apicv' variable,
there's no need to call kvm_apicv_init() from vendor specific code.
No functional change intended.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common
'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implement PM hibernation/suspend prepare notifiers so that KVM
can reliably set PVCLOCK_GUEST_STOPPED on VCPUs and properly
suspend VMs.
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Message-Id: <20210606021045.14159-2-senozhatsky@chromium.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't allow posted interrupts to modify a stale posted interrupt
descriptor (including the initial value of 0).
Empirical tests on real hardware reveal that a posted interrupt
descriptor referencing an unbacked address has PCI bus error semantics
(reads as all 1's; writes are ignored). However, kvm can't distinguish
unbacked addresses from device-backed (MMIO) addresses, so it should
really ask userspace for an MMIO completion. That's overly
complicated, so just punt with KVM_INTERNAL_ERROR.
Don't return the error until the posted interrupt descriptor is
actually accessed. We don't want to break the existing kvm-unit-tests
that assume they can launch an L2 VM with a posted interrupt
descriptor that references MMIO space in L1.
Fixes: 6beb7bd52e ("kvm: nVMX: Refactor nested_get_vmcs12_pages()")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-8-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the kernel has no mapping for the vmcs02 virtual APIC page,
userspace MMIO completion is necessary to process nested posted
interrupts. This is not a configuration that KVM supports. Rather than
silently ignoring the problem, try to exit to userspace with
KVM_INTERNAL_ERROR.
Note that the event that triggers this error is consumed as a
side-effect of a call to kvm_check_nested_events. On some paths
(notably through kvm_vcpu_check_block), the error is dropped. In any
case, this is an incremental improvement over always ignoring the
error.
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-7-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No functional change intended. At present, the only negative value
returned by kvm_check_nested_events is -EBUSY.
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-6-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No functional change intended. At present, 'r' will always be -EBUSY
on a control transfer to the 'out' label.
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-5-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No functional change intended.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20210604172611.281819-4-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A survey of the callsites reveals that they all ensure the vCPU is in
guest mode before calling kvm_check_nested_events. Remove this dead
code so that the only negative value this function returns (at the
moment) is -EBUSY.
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-2-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Calculate the TSC offset and multiplier on nested transitions and expose
the TSC scaling feature to L1.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-11-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the
VMCS every time the VMCS is loaded. Instead of doing this, set this
field from common code on initialization and whenever the scaling ratio
changes.
Additionally remove vmx->current_tsc_ratio. This field is redundant as
vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling
ratio. The vmx->current_tsc_ratio field is only used for avoiding
unnecessary writes but it is no longer needed after removing the code
from the VMCS load path.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Message-Id: <20210607105438.16541-1-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The write_l1_tsc_offset() callback has a misleading name. It does not
set L1's TSC offset, it rather updates the current TSC offset which
might be different if a nested guest is executing. Additionally, both
the vmx and svm implementations use the same logic for calculating the
current TSC before writing it to hardware.
Rename the function and move the common logic to the caller. The vmx/svm
specific code now merely sets the given offset to the corresponding
hardware structure.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-9-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L2 is entered we need to "merge" the TSC multiplier and TSC offset
values of 01 and 12 together.
The merging is done using the following equations:
offset_02 = ((offset_01 * mult_12) >> shift_bits) + offset_12
mult_02 = (mult_01 * mult_12) >> shift_bits
Where shift_bits is kvm_tsc_scaling_ratio_frac_bits.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-8-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to implement as much of the nested TSC scaling logic as
possible in common code, we need these vendor callbacks for retrieving
the TSC offset and the TSC multiplier that L1 has set for L2.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-7-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
other times (like when reading the TSC from user space) it needs to use
L1's scaling ratio. Have the caller specify this by passing the ratio as
a parameter.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-5-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All existing code uses kvm_compute_tsc_offset() passing L1 TSC values to
it. Let's document this by renaming it to kvm_compute_l1_tsc_offset().
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-4-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do
for L1's TSC offset. This allows for easy save/restore when we enter and
then exit the nested guest.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-3-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the TDP MMU is in use, wait to allocate the rmaps until the shadow
MMU is actually used. (i.e. a nested VM is launched.) This saves memory
equal to 0.2% of guest memory in cases where the TDP MMU is used and
there are no nested guests involved.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-8-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If only the TDP MMU is being used to manage the memory mappings for a VM,
then many rmap operations can be skipped as they are guaranteed to be
no-ops. This saves some time which would be spent on the rmap operation.
It also avoids acquiring the MMU lock in write mode for many operations.
This makes it safe to run the VM without rmaps allocated, when only
using the TDP MMU and sets the stage for waiting to allocate the rmaps
until they're needed.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-7-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a field to control whether new memslots should have rmaps allocated
for them. As of this change, it's not safe to skip allocating rmaps, so
the field is always set to allocate rmaps. Future changes will make it
safe to operate without rmaps, using the TDP MMU. Then further changes
will allow the rmaps to be allocated lazily when needed for nested
oprtation.
No functional change expected.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-6-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Small refactor to facilitate allocating rmaps for all memslots at once.
No functional change expected.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-3-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Small code deduplication. No functional change expected.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-2-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, when dirty logging is started in initially-all-set mode,
we write protect huge pages to prepare for splitting them into
4K pages, and leave normal pages untouched as the logging will
be enabled lazily as dirty bits are cleared.
However, enabling dirty logging lazily is also feasible for huge pages.
This not only reduces the time of start dirty logging, but it also
greatly reduces side-effect on guest when there is high dirty rate.
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Message-Id: <20210429034115.35560-3-zhukeqian1@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Prepare for write protecting large page lazily during dirty log tracking,
for which we will only need to write protect gfns at large page
granularity.
No functional or performance change expected.
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Message-Id: <20210429034115.35560-2-zhukeqian1@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that kvm_hv_flush_tlb() has been patched to support XMM hypercall
inputs, we can start advertising this feature to guests.
Cc: Alexander Graf <graf@amazon.com>
Cc: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <e63fc1c61dd2efecbefef239f4f0a598bd552750.1622019134.git.sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V supports the use of XMM registers to perform fast hypercalls.
This allows guests to take advantage of the improved performance of the
fast hypercall interface even though a hypercall may require more than
(the current maximum of) two input registers.
The XMM fast hypercall interface uses six additional XMM registers (XMM0
to XMM5) to allow the guest to pass an input parameter block of up to
112 bytes.
Add framework to read from XMM registers in kvm_hv_hypercall() and use
the additional hypercall inputs from XMM registers in kvm_hv_flush_tlb()
when possible.
Cc: Alexander Graf <graf@amazon.com>
Co-developed-by: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <fc62edad33f1920fe5c74dde47d7d0b4275a9012.1622019134.git.sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As of now there are 7 parameters (and flags) that are used in various
hyper-v hypercall handlers. There are 6 more input/output parameters
passed from XMM registers which are to be added in an upcoming patch.
To make passing arguments to the handlers more readable, capture all
these parameters into a single structure.
Cc: Alexander Graf <graf@amazon.com>
Cc: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <273f7ed510a1f6ba177e61b73a5c7bfbee4a4a87.1622019133.git.sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-v XMM fast hypercalls use XMM registers to pass input/output
parameters. To access these, hyperv.c can reuse some FPU register
accessors defined in emulator.c. Move them to a common location so both
can access them.
While at it, reorder the parameters of these accessor methods to make
them more readable.
Cc: Alexander Graf <graf@amazon.com>
Cc: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <01a85a6560714d4d3637d3d86e5eba65073318fa.1622019133.git.sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Function 'is_nx_huge_page_enabled' is called only by kvm/mmu, so make
it as inline fucntion and remove the unnecessary declaration.
Cc: Ben Gardon <bgardon@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Message-Id: <1622102271-63107-1-git-send-email-zhangshaokun@hisilicon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Calculate and check the full mmu_role when initializing the MMU context
for the nested MMU, where "full" means the bits and pieces of the role
that aren't handled by kvm_calc_mmu_role_common(). While the nested MMU
isn't used for shadow paging, things like the number of levels in the
guest's page tables are surprisingly important when walking the guest
page tables. Failure to reinitialize the nested MMU context if L2's
paging mode changes can result in unexpected and/or missed page faults,
and likely other explosions.
E.g. if an L1 vCPU is running both a 32-bit PAE L2 and a 64-bit L2, the
"common" role calculation will yield the same role for both L2s. If the
64-bit L2 is run after the 32-bit PAE L2, L0 will fail to reinitialize
the nested MMU context, ultimately resulting in a bad walk of L2's page
tables as the MMU will still have a guest root_level of PT32E_ROOT_LEVEL.
WARNING: CPU: 4 PID: 167334 at arch/x86/kvm/vmx/vmx.c:3075 ept_save_pdptrs+0x15/0xe0 [kvm_intel]
Modules linked in: kvm_intel]
CPU: 4 PID: 167334 Comm: CPU 3/KVM Not tainted 5.13.0-rc1-d849817d5673-reqs #185
Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
RIP: 0010:ept_save_pdptrs+0x15/0xe0 [kvm_intel]
Code: <0f> 0b c3 f6 87 d8 02 00f
RSP: 0018:ffffbba702dbba00 EFLAGS: 00010202
RAX: 0000000000000011 RBX: 0000000000000002 RCX: ffffffff810a2c08
RDX: ffff91d7bc30acc0 RSI: 0000000000000011 RDI: ffff91d7bc30a600
RBP: ffff91d7bc30a600 R08: 0000000000000010 R09: 0000000000000007
R10: 0000000000000000 R11: 0000000000000000 R12: ffff91d7bc30a600
R13: ffff91d7bc30acc0 R14: ffff91d67c123460 R15: 0000000115d7e005
FS: 00007fe8e9ffb700(0000) GS:ffff91d90fb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000029f15a001 CR4: 00000000001726e0
Call Trace:
kvm_pdptr_read+0x3a/0x40 [kvm]
paging64_walk_addr_generic+0x327/0x6a0 [kvm]
paging64_gva_to_gpa_nested+0x3f/0xb0 [kvm]
kvm_fetch_guest_virt+0x4c/0xb0 [kvm]
__do_insn_fetch_bytes+0x11a/0x1f0 [kvm]
x86_decode_insn+0x787/0x1490 [kvm]
x86_decode_emulated_instruction+0x58/0x1e0 [kvm]
x86_emulate_instruction+0x122/0x4f0 [kvm]
vmx_handle_exit+0x120/0x660 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xe25/0x1cb0 [kvm]
kvm_vcpu_ioctl+0x211/0x5a0 [kvm]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x40/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Fixes: bf627a9288 ("x86/kvm/mmu: check if MMU reconfiguration is needed in init_kvm_nested_mmu()")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210610220026.1364486-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit c9b8b07cde (KVM: x86: Dynamically allocate per-vCPU emulation context)
tries to allocate per-vCPU emulation context dynamically, however, the
x86_emulator slab cache is still exiting after the kvm module is unload
as below after destroying the VM and unloading the kvm module.
grep x86_emulator /proc/slabinfo
x86_emulator 36 36 2672 12 8 : tunables 0 0 0 : slabdata 3 3 0
This patch fixes this slab cache leak by destroying the x86_emulator slab cache
when the kvm module is unloaded.
Fixes: c9b8b07cde (KVM: x86: Dynamically allocate per-vCPU emulation context)
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1623387573-5969-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Send SEV_CMD_DECOMMISSION command to PSP firmware if ASID binding
fails. If a failure happens after a successful LAUNCH_START command,
a decommission command should be executed. Otherwise, guest context
will be unfreed inside the AMD SP. After the firmware will not have
memory to allocate more SEV guest context, LAUNCH_START command will
begin to fail with SEV_RET_RESOURCE_LIMIT error.
The existing code calls decommission inside sev_unbind_asid, but it is
not called if a failure happens before guest activation succeeds. If
sev_bind_asid fails, decommission is never called. PSP firmware has a
limit for the number of guests. If sev_asid_binding fails many times,
PSP firmware will not have resources to create another guest context.
Cc: stable@vger.kernel.org
Fixes: 59414c9892 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command")
Reported-by: Peter Gonda <pgonda@google.com>
Signed-off-by: Alper Gun <alpergun@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210610174604.2554090-1-alpergun@google.com>
Immediately reset the MMU context when the vCPU's SMM flag is cleared so
that the SMM flag in the MMU role is always synchronized with the vCPU's
flag. If RSM fails (which isn't correctly emulated), KVM will bail
without calling post_leave_smm() and leave the MMU in a bad state.
The bad MMU role can lead to a NULL pointer dereference when grabbing a
shadow page's rmap for a page fault as the initial lookups for the gfn
will happen with the vCPU's SMM flag (=0), whereas the rmap lookup will
use the shadow page's SMM flag, which comes from the MMU (=1). SMM has
an entirely different set of memslots, and so the initial lookup can find
a memslot (SMM=0) and then explode on the rmap memslot lookup (SMM=1).
general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 PID: 8410 Comm: syz-executor382 Not tainted 5.13.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__gfn_to_rmap arch/x86/kvm/mmu/mmu.c:935 [inline]
RIP: 0010:gfn_to_rmap+0x2b0/0x4d0 arch/x86/kvm/mmu/mmu.c:947
Code: <42> 80 3c 20 00 74 08 4c 89 ff e8 f1 79 a9 00 4c 89 fb 4d 8b 37 44
RSP: 0018:ffffc90000ffef98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888015b9f414 RCX: ffff888019669c40
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
RBP: 0000000000000001 R08: ffffffff811d9cdb R09: ffffed10065a6002
R10: ffffed10065a6002 R11: 0000000000000000 R12: dffffc0000000000
R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000
FS: 000000000124b300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000028e31000 CR4: 00000000001526e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
rmap_add arch/x86/kvm/mmu/mmu.c:965 [inline]
mmu_set_spte+0x862/0xe60 arch/x86/kvm/mmu/mmu.c:2604
__direct_map arch/x86/kvm/mmu/mmu.c:2862 [inline]
direct_page_fault+0x1f74/0x2b70 arch/x86/kvm/mmu/mmu.c:3769
kvm_mmu_do_page_fault arch/x86/kvm/mmu.h:124 [inline]
kvm_mmu_page_fault+0x199/0x1440 arch/x86/kvm/mmu/mmu.c:5065
vmx_handle_exit+0x26/0x160 arch/x86/kvm/vmx/vmx.c:6122
vcpu_enter_guest+0x3bdd/0x9630 arch/x86/kvm/x86.c:9428
vcpu_run+0x416/0xc20 arch/x86/kvm/x86.c:9494
kvm_arch_vcpu_ioctl_run+0x4e8/0xa40 arch/x86/kvm/x86.c:9722
kvm_vcpu_ioctl+0x70f/0xbb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3460
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:1069 [inline]
__se_sys_ioctl+0xfb/0x170 fs/ioctl.c:1055
do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x440ce9
Cc: stable@vger.kernel.org
Reported-by: syzbot+fb0b6a7e8713aeb0319c@syzkaller.appspotmail.com
Fixes: 9ec19493fb ("KVM: x86: clear SMM flags before loading state while leaving SMM")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple
of warnings by explicitly adding break statements instead of just letting
the code fall through to the next case.
Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Message-Id: <20210528200756.GA39320@embeddedor>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix kernel-doc warnings:
arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'activate' not described in 'avic_update_access_page'
arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'kvm' not described in 'avic_update_access_page'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'e' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'kvm' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'svm' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'vcpu_info' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:1009: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Message-Id: <20210609122217.2967131-1-chenxiaosong2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Per the SDM, "any access that touches bytes 4 through 15 of an APIC
register may cause undefined behavior and must not be executed."
Worse, such an access in kvm_lapic_reg_read can result in a leak of
kernel stack contents. Prior to commit 01402cf810 ("kvm: LAPIC:
write down valid APIC registers"), such an access was explicitly
disallowed. Restore the guard that was removed in that commit.
Fixes: 01402cf810 ("kvm: LAPIC: write down valid APIC registers")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Message-Id: <20210602205224.3189316-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
without nested page tables.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDAVpQUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkOgf9F97eFxAdod3/wbW9EbsUPR5bMTLE
+R6Hmvw+yCm/W2cycVGdCSh1BEKNuZN/XfHln2cYVfVr6ndog58A4Y0urFAhTROv
IHs8TCA5biQitoZ716l88ExOitnqJiSmMhGex969+zm1Lb9MQo1KA/zxERlqCi3s
Pfcxb6I8VbD9LEb6NaQdDgQoslJo1tzhe9gGYAYrpMOZujpj1RPeIOZIfeII0MP/
g14/JSar8cXc9QJ6zbiKn8HhpmzGJnaIsyFFL2RMIBlKvxsnpOU6VmisLTL9407o
P246Vq59BM8pdRCVUW9W9hLr2ho8lmi+ZYXASCm+qfn8cLaHyRCqSK56ZQ==
=nW43
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Bugfixes, including a TLB flush fix that affects processors without
nested page tables"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: fix previous commit for 32-bit builds
kvm: avoid speculation-based attacks from out-of-range memslot accesses
KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync
KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message
selftests: kvm: Add support for customized slot0 memory size
KVM: selftests: introduce P47V64 for s390x
KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior
KVM: X86: MMU: Use the correct inherited permissions to get shadow page
KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer
KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 238eca821c
When using shadow paging, unload the guest MMU when emulating a guest TLB
flush to ensure all roots are synchronized. From the guest's perspective,
flushing the TLB ensures any and all modifications to its PTEs will be
recognized by the CPU.
Note, unloading the MMU is overkill, but is done to mirror KVM's existing
handling of INVPCID(all) and ensure the bug is squashed. Future cleanup
can be done to more precisely synchronize roots when servicing a guest
TLB flush.
If TDP is enabled, synchronizing the MMU is unnecessary even if nested
TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
TDP mappings. For EPT, an explicit INVEPT is required to invalidate
guest-physical mappings; for NPT, guest mappings are always tagged with
an ASID and thus can only be invalidated via the VMCB's ASID control.
This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB.
It was only recently exposed after Linux guests stopped flushing the
local CPU's TLB prior to flushing remote TLBs (see commit 4ce94eabac,
"x86/mm/tlb: Flush remote and local TLBs concurrently"), but is also
visible in Windows 10 guests.
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: f38a7b7526 ("KVM: X86: support paravirtualized help for TLB shootdowns")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
[sean: massaged comment and changelog]
Message-Id: <20210531172256.2908-1-jiangshanlai@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the __string() machinery provided by the tracing subystem to make a
copy of the string literals consumed by the "nested VM-Enter failed"
tracepoint. A complete copy is necessary to ensure that the tracepoint
can't outlive the data/memory it consumes and deference stale memory.
Because the tracepoint itself is defined by kvm, if kvm-intel and/or
kvm-amd are built as modules, the memory holding the string literals
defined by the vendor modules will be freed when the module is unloaded,
whereas the tracepoint and its data in the ring buffer will live until
kvm is unloaded (or "indefinitely" if kvm is built-in).
This bug has existed since the tracepoint was added, but was recently
exposed by a new check in tracing to detect exactly this type of bug.
fmt: '%s%s
' current_buffer: ' vmx_dirty_log_t-140127 [003] .... kvm_nested_vmenter_failed: '
WARNING: CPU: 3 PID: 140134 at kernel/trace/trace.c:3759 trace_check_vprintf+0x3be/0x3e0
CPU: 3 PID: 140134 Comm: less Not tainted 5.13.0-rc1-ce2e73ce600a-req #184
Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
RIP: 0010:trace_check_vprintf+0x3be/0x3e0
Code: <0f> 0b 44 8b 4c 24 1c e9 a9 fe ff ff c6 44 02 ff 00 49 8b 97 b0 20
RSP: 0018:ffffa895cc37bcb0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffa895cc37bd08 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff9766cfad74f8
RBP: ffffffffc0a041d4 R08: ffff9766cfad74f0 R09: ffffa895cc37bad8
R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffc0a041d4
R13: ffffffffc0f4dba8 R14: 0000000000000000 R15: ffff976409f2c000
FS: 00007f92fa200740(0000) GS:ffff9766cfac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000559bd11b0000 CR3: 000000019fbaa002 CR4: 00000000001726e0
Call Trace:
trace_event_printf+0x5e/0x80
trace_raw_output_kvm_nested_vmenter_failed+0x3a/0x60 [kvm]
print_trace_line+0x1dd/0x4e0
s_show+0x45/0x150
seq_read_iter+0x2d5/0x4c0
seq_read+0x106/0x150
vfs_read+0x98/0x180
ksys_read+0x5f/0xe0
do_syscall_64+0x40/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Cc: Steven Rostedt <rostedt@goodmis.org>
Fixes: 380e0055bc ("KVM: nVMX: trace nested VM-Enter failures detected by H/W")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Message-Id: <20210607175748.674002-1-seanjc@google.com>
In record_steal_time(), st->preempted is read twice, and
trace_kvm_pv_tlb_flush() might output result inconsistent if
kvm_vcpu_flush_tlb_guest() see a different st->preempted later.
It is a very trivial problem and hardly has actual harm and can be
avoided by reseting and reading st->preempted in atomic way via xchg().
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When computing the access permissions of a shadow page, use the effective
permissions of the walk up to that point, i.e. the logic AND of its parents'
permissions. Two guest PxE entries that point at the same table gfn need to
be shadowed with different shadow pages if their parents' permissions are
different. KVM currently uses the effective permissions of the last
non-leaf entry for all non-leaf entries. Because all non-leaf SPTEs have
full ("uwx") permissions, and the effective permissions are recorded only
in role.access and merged into the leaves, this can lead to incorrect
reuse of a shadow page and eventually to a missing guest protection page
fault.
For example, here is a shared pagetable:
pgd[] pud[] pmd[] virtual address pointers
/->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--)
/->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-)
pgd-| (shared pmd[] as above)
\->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--)
\->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--)
pud1 and pud2 point to the same pmd table, so:
- ptr1 and ptr3 points to the same page.
- ptr2 and ptr4 points to the same page.
(pud1 and pud2 here are pud entries, while pmd1 and pmd2 here are pmd entries)
- First, the guest reads from ptr1 first and KVM prepares a shadow
page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1.
"u--" comes from the effective permissions of pgd, pud1 and
pmd1, which are stored in pt->access. "u--" is used also to get
the pagetable for pud1, instead of "uw-".
- Then the guest writes to ptr2 and KVM reuses pud1 which is present.
The hypervisor set up a shadow page for ptr2 with pt->access is "uw-"
even though the pud1 pmd (because of the incorrect argument to
kvm_mmu_get_page in the previous step) has role.access="u--".
- Then the guest reads from ptr3. The hypervisor reuses pud1's
shadow pmd for pud2, because both use "u--" for their permissions.
Thus, the shadow pmd already includes entries for both pmd1 and pmd2.
- At last, the guest writes to ptr4. This causes no vmexit or pagefault,
because pud1's shadow page structures included an "uw-" page even though
its role.access was "u--".
Any kind of shared pagetable might have the similar problem when in
virtual machine without TDP enabled if the permissions are different
from different ancestors.
In order to fix the problem, we change pt->access to be an array, and
any access in it will not include permissions ANDed from child ptes.
The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/
Remember to test it with TDP disabled.
The problem had existed long before the commit 41074d07c7 ("KVM: MMU:
Fix inherited permissions for emulated guest pte updates"), and it
is hard to find which is the culprit. So there is no fixes tag here.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210603052455.21023-1-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
Fixes: cea0f0e7ea ("[PATCH] KVM: MMU: Shadow page table caching")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to the SDM 10.5.4.1:
A write of 0 to the initial-count register effectively stops the local
APIC timer, in both one-shot and periodic mode.
However, the lapic timer oneshot/periodic mode which is emulated by vmx-preemption
timer doesn't stop by writing 0 to TMICT since vmx->hv_deadline_tsc is still
programmed and the guest will receive the spurious timer interrupt later. This
patch fixes it by also cancelling the vmx-preemption timer when writing 0 to
the initial-count register.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1623050385-100988-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 238eca821c ("KVM: SVM: Allocate SEV command structures on local stack")
uses the local stack to allocate the structures used to communicate with the PSP,
which were earlier being kzalloced. This breaks SEV live migration for
computing the SEND_START session length and SEND_UPDATE_DATA query length as
session_len and trans_len and hdr_len fields are not zeroed respectively for
the above commands before issuing the SEV Firmware API call, hence the
firmware returns incorrect session length and update data header or trans length.
Also the SEV Firmware API returns SEV_RET_INVALID_LEN firmware error
for these length query API calls, and the return value and the
firmware error needs to be passed to the userspace as it is, so
need to remove the return check in the KVM code.
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-Id: <20210607061532.27459-1-Ashish.Kalra@amd.com>
Fixes: 238eca821c ("KVM: SVM: Allocate SEV command structures on local stack")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmC9UH8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGRDYH/3WgnRz5DfVhjmlD
Lg38mPmbZWhFibXghrYrpbVpTyhjGFRuNtXAt2p7/nYnM71wzI6Qkx6cRKZeB5HE
/SqeksPWUEgJaUuoXeQBrBaG7q/+9ph7Rgaf2wP7k+E00RI3E4pbMubuqFAUeikr
itKFD9aTUsgT5XbG2hH5Ddwh5hBD2C/1PVt3jpLnJkXRCn91uEh+R7SHXP/fsjAd
ZaGOVbAGm+jePCQDBXpVUn+8fJdxvQg7rxWVRRRhi5LXG+pnAezbkGl746zBwaSw
K6lmVSA+eAiVkKu6nR4HJv9Hax1juFbp9xpcCo4jzxO5NJF4jsmytjLEaYFdi4NX
G542808=
=BPDL
-----END PGP SIGNATURE-----
Merge tag 'v5.13-rc5' into x86/cleanups
Pick up dependent changes in order to base further cleanups ontop.
Signed-off-by: Borislav Petkov <bp@suse.de>
* Another state update on exit to userspace fix
* Prevent the creation of mixed 32/64 VMs
* Fix regression with irqbypass not restarting the guest on failed connect
* Fix regression with debug register decoding resulting in overlapping access
* Commit exception state on exit to usrspace
* Fix the MMU notifier return values
* Add missing 'static' qualifiers in the new host stage-2 code
x86 fixes:
* fix guest missed wakeup with assigned devices
* fix WARN reported by syzkaller
* do not use BIT() in UAPI headers
* make the kvm_amd.avic parameter bool
PPC fixes:
* make halt polling heuristics consistent with other architectures
selftests:
* various fixes
* new performance selftest memslot_perf_test
* test UFFD minor faults in demand_paging_test
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCyF0MUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOHSgf/Q4Hm5e12Bj2xJy6A+iShnrbbT8PW
hcIIOA7zGWXfjVYcBV7anbj7CcpzfIz0otcRBABa5mkhj+fb3YmPEb0EzCPi4Hru
zxpcpB2w7W7WtUOIKe2EmaT+4Pk6/iLcfr8UMHMqx460akE9OmIg10QNWai3My/3
RIOeakSckBI9e/1TQZbxH66dsLwCT0lLco7i7AWHdFxkzUQyoA34HX5pczOCBsO5
3nXH+/txnRVhqlcyzWLVVGVzFqmpHtBqkIInDOXfUqIoxo/gOhOgF1QdMUEKomxn
5ZFXlL5IXNtr+7yiI67iHX7CWkGZE9oJ04TgPHn6LR6wRnVvc3JInzcB5Q==
=ollO
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM fixes:
- Another state update on exit to userspace fix
- Prevent the creation of mixed 32/64 VMs
- Fix regression with irqbypass not restarting the guest on failed
connect
- Fix regression with debug register decoding resulting in
overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
x86 fixes:
- fix guest missed wakeup with assigned devices
- fix WARN reported by syzkaller
- do not use BIT() in UAPI headers
- make the kvm_amd.avic parameter bool
PPC fixes:
- make halt polling heuristics consistent with other architectures
selftests:
- various fixes
- new performance selftest memslot_perf_test
- test UFFD minor faults in demand_paging_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
selftests: kvm: fix overlapping addresses in memslot_perf_test
KVM: X86: Kill off ctxt->ud
KVM: X86: Fix warning caused by stale emulation context
KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception
KVM: x86/mmu: Fix comment mentioning skip_4k
KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK
KVM: x86: add start_assignment hook to kvm_x86_ops
KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch
selftests: kvm: do only 1 memslot_perf_test run by default
KVM: X86: Use _BITUL() macro in UAPI headers
KVM: selftests: add shared hugetlbfs backing source type
KVM: selftests: allow using UFFD minor faults for demand paging
KVM: selftests: create alias mappings when using shared memory
KVM: selftests: add shmem backing source type
KVM: selftests: refactor vm_mem_backing_src_type flags
KVM: selftests: allow different backing source types
KVM: selftests: compute correct demand paging size
KVM: selftests: simplify setup_demand_paging error handling
KVM: selftests: Print a message if /dev/kvm is missing
...
ctxt->ud is consumed only by x86_decode_insn(), we can kill it off by
passing emulation_type to x86_decode_insn() and dropping ctxt->ud
altogether. Tracking that info in ctxt for literally one call is silly.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <1622160097-37633-2-git-send-email-wanpengli@tencent.com>
Reported by syzkaller:
WARNING: CPU: 7 PID: 10526 at linux/arch/x86/kvm//x86.c:7621 x86_emulate_instruction+0x41b/0x510 [kvm]
RIP: 0010:x86_emulate_instruction+0x41b/0x510 [kvm]
Call Trace:
kvm_mmu_page_fault+0x126/0x8f0 [kvm]
vmx_handle_exit+0x11e/0x680 [kvm_intel]
vcpu_enter_guest+0xd95/0x1b40 [kvm]
kvm_arch_vcpu_ioctl_run+0x377/0x6a0 [kvm]
kvm_vcpu_ioctl+0x389/0x630 [kvm]
__x64_sys_ioctl+0x8e/0xd0
do_syscall_64+0x3c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Commit 4a1e10d5b5 ("KVM: x86: handle hardware breakpoints during emulation())
adds hardware breakpoints check before emulation the instruction and parts of
emulation context initialization, actually we don't have the EMULTYPE_NO_DECODE flag
here and the emulation context will not be reused. Commit c8848cee74 ("KVM: x86:
set ctxt->have_exception in x86_decode_insn()) triggers the warning because it
catches the stale emulation context has #UD, however, it is not during instruction
decoding which should result in EMULATION_FAILED. This patch fixes it by moving
the second part emulation context initialization into init_emulate_ctxt() and
before hardware breakpoints check. The ctxt->ud will be dropped by a follow-up
patch.
syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=134683fdd00000
Reported-by: syzbot+71271244f206d17f6441@syzkaller.appspotmail.com
Fixes: 4a1e10d5b5 (KVM: x86: handle hardware breakpoints during emulation)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <1622160097-37633-1-git-send-email-wanpengli@tencent.com>
The kvm_get_linear_rip() handles x86/long mode cases well and has
better readability, __kvm_set_rflags() also use the paired
function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip
set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the
"CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and
handle_exception_nmi() to this one.
Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This comment was left over from a previous version of the patch that
introduced wrprot_gfn_range, when skip_4k was passed in instead of
min_level.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210526163227.3113557-1-dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For VMX, when a vcpu enters HLT emulation, pi_post_block will:
1) Add vcpu to per-cpu list of blocked vcpus.
2) Program the posted-interrupt descriptor "notification vector"
to POSTED_INTR_WAKEUP_VECTOR
With interrupt remapping, an interrupt will set the PIR bit for the
vector programmed for the device on the CPU, test-and-set the
ON bit on the posted interrupt descriptor, and if the ON bit is clear
generate an interrupt for the notification vector.
This way, the target CPU wakes upon a device interrupt and wakes up
the target vcpu.
Problem is that pi_post_block only programs the notification vector
if kvm_arch_has_assigned_device() is true. Its possible for the
following to happen:
1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
notification vector is not programmed
2) device is assigned to VM
3) device interrupts vcpu V, sets ON bit
(notification vector not programmed, so pcpu P remains in idle)
4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
kvm_vcpu_kick is skipped
5) vcpu 0 busy spins on vcpu V's response for several seconds, until
RCU watchdog NMIs all vCPUs.
To fix this, use the start_assignment kvm_x86_ops callback to kick
vcpus out of the halt loop, so the notification vector is
properly reprogrammed to the wakeup vector.
Reported-by: Pei Zhang <pezhang@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Message-Id: <20210526172014.GA29007@fuller.cnet>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_REQ_UNBLOCK will be used to exit a vcpu from
its inner vcpu halt emulation loop.
Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch
PowerPC to arch specific request bit.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Message-Id: <20210525134321.303768132@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a start_assignment hook to kvm_x86_ops, which is called when
kvm_arch_start_assignment is done.
The hook is required to update the wakeup vector of a sleeping vCPU
when a device is assigned to the guest.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Message-Id: <20210525134321.254128742@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Let's treat lapic_timer_advance_ns automatic tuning logic as hypervisor
overhead, move it before wait_lapic_expire instead of between wait_lapic_expire
and the world switch, the wait duration should be calculated by the
up-to-date guest_tsc after the overhead of automatic tuning logic. This
patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing
busy waits.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-5-git-send-email-wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARNING: suspicious RCU usage
5.13.0-rc1 #4 Not tainted
-----------------------------
./include/linux/kvm_host.h:710 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by hyperv_clock/8318:
#0: ffffb6b8cb05a7d8 (&hv->hv_lock){+.+.}-{3:3}, at: kvm_hv_invalidate_tsc_page+0x3e/0xa0 [kvm]
stack backtrace:
CPU: 3 PID: 8318 Comm: hyperv_clock Not tainted 5.13.0-rc1 #4
Call Trace:
dump_stack+0x87/0xb7
lockdep_rcu_suspicious+0xce/0xf0
kvm_write_guest_page+0x1c1/0x1d0 [kvm]
kvm_write_guest+0x50/0x90 [kvm]
kvm_hv_invalidate_tsc_page+0x79/0xa0 [kvm]
kvm_gen_update_masterclock+0x1d/0x110 [kvm]
kvm_arch_vm_ioctl+0x2a7/0xc50 [kvm]
kvm_vm_ioctl+0x123/0x11d0 [kvm]
__x64_sys_ioctl+0x3ed/0x9d0
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
kvm_memslots() will be called by kvm_write_guest(), so we should take the srcu lock.
Fixes: e880c6ea5 (KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs)
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-4-git-send-email-wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 66570e966d (kvm: x86: only provide PV features if enabled in guest's
CPUID) avoids to access pv tlb shootdown host side logic when this pv feature
is not exposed to guest, however, kvm_steal_time.preempted not only leveraged
by pv tlb shootdown logic but also mitigate the lock holder preemption issue.
From guest's point of view, vCPU is always preempted since we lose the reset
of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not
exposed. This patch fixes it by clearing kvm_steal_time.preempted before
vmentry.
Fixes: 66570e966d (kvm: x86: only provide PV features if enabled in guest's CPUID)
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-3-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In case of under-committed scenarios, vCPUs can be scheduled easily;
kvm_vcpu_yield_to adds extra overhead, and it is also common to see
when vcpu->ready is true but yield later failing due to p->state is
TASK_RUNNING.
Let's bail out in such scenarios by checking the length of current cpu
runqueue, which can be treated as a hint of under-committed instead of
guarantee of accuracy. 30%+ of directed-yield attempts can now avoid
the expensive lookups in kvm_sched_yield() in an under-committed scenario.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make it consistent with kvm_intel.enable_apicv.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CONFIG_X86_LOCAL_APIC is always on when CONFIG_KVM (on x86) since
commit e42eef4ba3 ("KVM: add X86_LOCAL_APIC dependency").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210518144339.1987982-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
AVIC dependency on CONFIG_X86_LOCAL_APIC is dead code since
commit e42eef4ba3 ("KVM: add X86_LOCAL_APIC dependency").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210518144339.1987982-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCfmUoPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDei8QAMOWMA9wFTydsMTyRwDDZzD9i3Vg4bYlTdj1
1C1FiHHGL37t44coo1eHtnydWBuhxhhwDHWQE8owFbDHyOnPzEX+NwhmJ4gVlUW5
51aSxfPgXzKiv17WyncqZO9SfA5/RFyA/C2gRq9/fMr/7CpQJjqrvdQXaWh4kPVa
9jFMVd1sCDUPd5c9Jyxd42CmVZjg6mCorOKaEwlI7NZkulRBlFW21A5y+M57sGTF
RLIuQcggFJaG17kZN4p6v55Yoclt8O4xVbDv8SZV3vO1gjpaF1LtXdsmAKvbDZrZ
lEtdumPHyD1maFhwXQFMOyvOgEaRhlhiNaTgKUOyX2LgeW1utCiYO/KwysflZvIC
oLsfx3x+G0nSxa+MWGL9m52Hrt4yyscfbKfBg6nqJB+AqD3teH20xfsEUHTEuYkW
kEgeWcJcWkadL5+ngs6S4PwFr88NyVBdUAagNd5VXE/KFhxCcr4B9oOXk5WdOaMi
ZvLG5IQfIH6k3w+h2wR2WSoxYwltriZ3PwrPIeJ2Se33bK15xtQy1k/IIqvZP/oK
0xxRVoY+nwuru0QZGwyI7zCFFvZzEKOXJ3qzJ2NeQxoTBky/e0bvUwnU8gXLXGPM
lx2Gzw6t+xlTfcF9oIaQq7WlOsrC7Zr4uiTurZGLZKWklso9tLdzW35zmdN6D3qx
sP2LC4iv
=57tg
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.13, take #1
- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
- Reorganize SEV code to streamline and simplify future development
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCg1XQACgkQEsHwGGHe
VUpRKA//dwzDD1QU16JucfhgFlv/9OTm48ukSwAb9lZjDEy4H1CtVL3xEHFd7L3G
LJp0LTW+OQf0/0aGlQp/cP6sBF6G9Bf4mydx70Id4SyCQt8eZDodB+ZOOWbeteWq
p92fJPbX8CzAglutbE+3v/MD8CCAllTiLZnJZPVj4Kux2/wF6EryDgF1+rb5q8jp
ObTT9817mHVwWVUYzbgceZtd43IocOlKZRmF1qivwScMGylQTe1wfMjunpD5pVt8
Zg4UDNknNfYduqpaG546E6e1zerGNaJK7SHnsuzHRUVU5icNqtgBk061CehP9Ksq
DvYXLUl4xF16j6xJAqIZPNrBkJGdQf4q1g5x2FiBm7rSQU5owzqh5rkVk4EBFFzn
UtzeXpqbStbsZHXycyxBNdq2HXxkFPf2NXZ+bkripPg+DifOGots1uwvAft+6iAE
GudK6qxAvr8phR1cRyy6BahGtgOStXbZYEz0ZdU6t7qFfZMz+DomD5Jimj0kAe6B
s6ras5xm8q3/Py87N/KNjKtSEpgsHv/7F+idde7ODtHhpRL5HCBqhkZOSRkMMZqI
ptX1oSTvBXwRKyi5x9YhkKHUFqfFSUTfJhiRFCWK+IEAv3Y7SipJtfkqxRbI6fEV
FfCeueKDDdViBtseaRceVLJ8Tlr6Qjy27fkPPTqJpthqPpCdoZ0=
=ENfF
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"The three SEV commits are not really urgent material. But we figured
since getting them in now will avoid a huge amount of conflicts
between future SEV changes touching tip, the kvm and probably other
trees, sending them to you now would be best.
The idea is that the tip, kvm etc branches for 5.14 will all base
ontop of -rc2 and thus everything will be peachy. What is more, those
changes are purely mechanical and defines movement so they should be
fine to go now (famous last words).
Summary:
- Enable -Wundef for the compressed kernel build stage
- Reorganize SEV code to streamline and simplify future development"
* tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/compressed: Enable -Wundef
x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
x86/sev-es: Rename sev-es.{ch} to sev.{ch}
AFAICT KVM only relies on SCHED_INFO. Nothing uses the p->delays data
that belongs to TASK_DELAY_ACCT.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Link: https://lkml.kernel.org/r/20210505111525.187225172@infradead.org
* Fix virtualization of RDPID
* Virtualization of DR6_BUS_LOCK, which on bare metal is new in
the 5.13 merge window
* More nested virtualization migration fixes (nSVM and eVMCS)
* Fix for KVM guest hibernation
* Fix for warning in SEV-ES SRCU usage
* Block KVM from loading on AMD machines with 5-level page tables,
due to the APM not mentioning how host CR4.LA57 exactly impacts
the guest.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCZWwgUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOE9wgAk7Io8cuvnhC9ogVqzZWrPweWqFg8
fJcPMB584JRnMqYHBVYbkTPGe8SsCHKR2MKsNdc4cEP111cyr3suWsxOdmjJn58i
7ahy6PcKx7wWeWwEt7O599l6CeoX5XB9ExvA6eiXAv7iZeOJHFa+Ny2GlWgauy6Y
DELryEomx1r4IUkZaSR+2fYjzvOWTXQixwU/jwx8NcTJz0DrzknzLE7XOciPBfn0
t0Q2rCXdL2nF1uPksZbntx8Qoa6t6GDVIyrH/ZCPQYJtAX6cjxNAh3zwCe+hMnOd
fW8ntBH1nZRiNnberA4IICAzqnUokgPWdKBrZT2ntWHBK+aqxXHznrlPJA==
=e+gD
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- Lots of bug fixes.
- Fix virtualization of RDPID
- Virtualization of DR6_BUS_LOCK, which on bare metal is new to this
release
- More nested virtualization migration fixes (nSVM and eVMCS)
- Fix for KVM guest hibernation
- Fix for warning in SEV-ES SRCU usage
- Block KVM from loading on AMD machines with 5-level page tables, due
to the APM not mentioning how host CR4.LA57 exactly impacts the
guest.
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (48 commits)
KVM: SVM: Move GHCB unmapping to fix RCU warning
KVM: SVM: Invert user pointer casting in SEV {en,de}crypt helpers
kvm: Cap halt polling at kvm->max_halt_poll_ns
tools/kvm_stat: Fix documentation typo
KVM: x86: Prevent deadlock against tk_core.seq
KVM: x86: Cancel pvclock_gtod_work on module removal
KVM: x86: Prevent KVM SVM from loading on kernels with 5-level paging
KVM: X86: Expose bus lock debug exception to guest
KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacks
KVM: x86: Hide RDTSCP and RDPID if MSR_TSC_AUX probing failed
KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model
KVM: x86: Move uret MSR slot management to common x86
KVM: x86: Export the number of uret MSRs to vendor modules
KVM: VMX: Disable loading of TSX_CTRL MSR the more conventional way
KVM: VMX: Use common x86's uret MSR list as the one true list
KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list
KVM: VMX: Configure list of user return MSRs at module init
KVM: x86: Add support for RDPID without RDTSCP
KVM: SVM: Probe and load MSR_TSC_AUX regardless of RDTSCP support in host
...
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8
name from it.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com
The guest and the hypervisor contain separate macros to get and set
the GHCB MSR protocol and NAE event fields. Consolidate the GHCB
protocol definitions and helper macros in one place.
Leave the supported protocol version define in separate files to keep
the guest and hypervisor flexibility to support different GHCB version
in the same release.
There is no functional change intended.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-3-brijesh.singh@amd.com
When an SEV-ES guest is running, the GHCB is unmapped as part of the
vCPU run support. However, kvm_vcpu_unmap() triggers an RCU dereference
warning with CONFIG_PROVE_LOCKING=y because the SRCU lock is released
before invoking the vCPU run support.
Move the GHCB unmapping into the prepare_guest_switch callback, which is
invoked while still holding the SRCU lock, eliminating the RCU dereference
warning.
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b2f9b79d15166f2c3e4375c0d9bc3268b7696455.1620332081.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Invert the user pointer params for SEV's helpers for encrypting and
decrypting guest memory so that they take a pointer and cast to an
unsigned long as necessary, as opposed to doing the opposite. Tagging a
non-pointer as __user is confusing and weird since a cast of some form
needs to occur to actually access the user data. This also fixes Sparse
warnings triggered by directly consuming the unsigned longs, which are
"noderef" due to the __user tag.
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210506231542.2331138-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
syzbot reported a possible deadlock in pvclock_gtod_notify():
CPU 0 CPU 1
write_seqcount_begin(&tk_core.seq);
pvclock_gtod_notify() spin_lock(&pool->lock);
queue_work(..., &pvclock_gtod_work) ktime_get()
spin_lock(&pool->lock); do {
seq = read_seqcount_begin(tk_core.seq)
...
} while (read_seqcount_retry(&tk_core.seq, seq);
While this is unlikely to happen, it's possible.
Delegate queue_work() to irq_work() which postpones it until the
tk_core.seq write held region is left and interrupts are reenabled.
Fixes: 16e8d74d2d ("KVM: x86: notifier for clocksource changes")
Reported-by: syzbot+6beae4000559d41d80f8@syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Message-Id: <87h7jgm1zy.ffs@nanos.tec.linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nothing prevents the following:
pvclock_gtod_notify()
queue_work(system_long_wq, &pvclock_gtod_work);
...
remove_module(kvm);
...
work_queue_run()
pvclock_gtod_work() <- UAF
Ditto for any other operation on that workqueue list head which touches
pvclock_gtod_work after module removal.
Cancel the work in kvm_arch_exit() to prevent that.
Fixes: 16e8d74d2d ("KVM: x86: notifier for clocksource changes")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Message-Id: <87czu4onry.ffs@nanos.tec.linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disallow loading KVM SVM if 5-level paging is supported. In theory, NPT
for L1 should simply work, but there unknowns with respect to how the
guest's MAXPHYADDR will be handled by hardware.
Nested NPT is more problematic, as running an L1 VMM that is using
2-level page tables requires stacking single-entry PDP and PML4 tables in
KVM's NPT for L2, as there are no equivalent entries in L1's NPT to
shadow. Barring hardware magic, for 5-level paging, KVM would need stack
another layer to handle PML5.
Opportunistically rename the lm_root pointer, which is used for the
aforementioned stacking when shadowing 2-level L1 NPT, to pml4_root to
call out that it's specifically for PML4.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210505204221.1934471-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bus lock debug exception is an ability to notify the kernel by an #DB
trap after the instruction acquires a bus lock and is executed when
CPL>0. This allows the kernel to enforce user application throttling or
mitigations.
Existence of bus lock debug exception is enumerated via
CPUID.(EAX=7,ECX=0).ECX[24]. Software can enable these exceptions by
setting bit 2 of the MSR_IA32_DEBUGCTL. Expose the CPUID to guest and
emulate the MSR handling when guest enables it.
Support for this feature was originally developed by Xiaoyao Li and
Chenyi Qiang, but code has since changed enough that this patch has
nothing in common with theirs, except for this commit message.
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bus lock debug exception introduces a new bit DR6_BUS_LOCK (bit 11 of
DR6) to indicate that bus lock #DB exception is generated. The set/clear
of DR6_BUS_LOCK is similar to the DR6_RTM. The processor clears
DR6_BUS_LOCK when the exception is generated. For all other #DB, the
processor sets this bit to 1. Software #DB handler should set this bit
before returning to the interrupted task.
In VMM, to avoid breaking the CPUs without bus lock #DB exception
support, activate the DR6_BUS_LOCK conditionally in DR6_FIXED_1 bits.
When intercepting the #DB exception caused by bus locks, bit 11 of the
exit qualification is set to identify it. The VMM should emulate the
exception by clearing the bit 11 of the guest DR6.
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-3-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If probing MSR_TSC_AUX failed, hide RDTSCP and RDPID, and WARN if either
feature was reported as supported. In theory, such a scenario should
never happen as both Intel and AMD state that MSR_TSC_AUX is available if
RDTSCP or RDPID is supported. But, KVM injects #GP on MSR_TSC_AUX
accesses if probing failed, faults on WRMSR(MSR_TSC_AUX) may be fatal to
the guest (because they happen during early CPU bringup), and KVM itself
has effectively misreported RDPID support in the past.
Note, this also has the happy side effect of omitting MSR_TSC_AUX from
the list of MSRs that are exposed to userspace if probing the MSR fails.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Squish the Intel and AMD emulation of MSR_TSC_AUX together and tie it to
the guest CPU model instead of the host CPU behavior. While not strictly
necessary to avoid guest breakage, emulating cross-vendor "architecture"
will provide consistent behavior for the guest, e.g. WRMSR fault behavior
won't change if the vCPU is migrated to a host with divergent behavior.
Note, the "new" kvm_is_supported_user_return_msr() checks do not add new
functionality on either SVM or VMX. On SVM, the equivalent was
"tsc_aux_uret_slot < 0", and on VMX the check was buried in the
vmx_find_uret_msr() call at the find_uret_msr label.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that SVM and VMX both probe MSRs before "defining" user return slots
for them, consolidate the code for probe+define into common x86 and
eliminate the odd behavior of having the vendor code define the slot for
a given MSR.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split out and export the number of configured user return MSRs so that
VMX can iterate over the set of MSRs without having to do its own tracking.
Keep the list itself internal to x86 so that vendor code still has to go
through the "official" APIs to add/modify entries.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tag TSX_CTRL as not needing to be loaded when RTM isn't supported in the
host. Crushing the write mask to '0' has the same effect, but requires
more mental gymnastics to understand.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop VMX's global list of user return MSRs now that VMX doesn't resort said
list to isolate "active" MSRs, i.e. now that VMX's list and x86's list have
the same MSRs in the same order.
In addition to eliminating the redundant list, this will also allow moving
more of the list management into common x86.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly flag a uret MSR as needing to be loaded into hardware instead of
resorting the list of "active" MSRs and tracking how many MSRs in total
need to be loaded. The only benefit to sorting the list is that the loop
to load MSRs during vmx_prepare_switch_to_guest() doesn't need to iterate
over all supported uret MRS, only those that are active. But that is a
pointless optimization, as the most common case, running a 64-bit guest,
will load the vast majority of MSRs. Not to mention that a single WRMSR is
far more expensive than iterating over the list.
Providing a stable list order obviates the need to track a given MSR's
"slot" in the per-CPU list of user return MSRs; all lists simply use the
same ordering. Future patches will take advantage of the stable order to
further simplify the related code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Configure the list of user return MSRs that are actually supported at
module init instead of reprobing the list of possible MSRs every time a
vCPU is created. Curating the list on a per-vCPU basis is pointless; KVM
is completely hosed if the set of supported MSRs changes after module init,
or if the set of MSRs differs per physical PCU.
The per-vCPU lists also increase complexity (see __vmx_find_uret_msr()) and
creates corner cases that _should_ be impossible, but theoretically exist
in KVM, e.g. advertising RDTSCP to userspace without actually being able to
virtualize RDTSCP if probing MSR_TSC_AUX fails.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow userspace to enable RDPID for a guest without also enabling RDTSCP.
Aside from checking for RDPID support in the obvious flows, VMX also needs
to set ENABLE_RDTSCP=1 when RDPID is exposed.
For the record, there is no known scenario where enabling RDPID without
RDTSCP is desirable. But, both AMD and Intel architectures allow for the
condition, i.e. this is purely to make KVM more architecturally accurate.
Fixes: 41cd02c6f7 ("kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID")
Cc: stable@vger.kernel.org
Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Probe MSR_TSC_AUX whether or not RDTSCP is supported in the host, and
if probing succeeds, load the guest's MSR_TSC_AUX into hardware prior to
VMRUN. Because SVM doesn't support interception of RDPID, RDPID cannot
be disallowed in the guest (without resorting to binary translation).
Leaving the host's MSR_TSC_AUX in hardware would leak the host's value to
the guest if RDTSCP is not supported.
Note, there is also a kernel bug that prevents leaking the host's value.
The host kernel initializes MSR_TSC_AUX if and only if RDTSCP is
supported, even though the vDSO usage consumes MSR_TSC_AUX via RDPID.
I.e. if RDTSCP is not supported, there is no host value to leak. But,
if/when the host kernel bug is fixed, KVM would start leaking MSR_TSC_AUX
in the case where hardware supports RDPID but RDTSCP is unavailable for
whatever reason.
Probing MSR_TSC_AUX will also allow consolidating the probe and define
logic in common x86, and will make it simpler to condition the existence
of MSR_TSX_AUX (from the guest's perspective) on RDTSCP *or* RDPID.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disable preemption when probing a user return MSR via RDSMR/WRMSR. If
the MSR holds a different value per logical CPU, the WRMSR could corrupt
the host's value if KVM is preempted between the RDMSR and WRMSR, and
then rescheduled on a different CPU.
Opportunistically land the helper in common x86, SVM will use the helper
in a future commit.
Fixes: 4be5341026 ("KVM: VMX: Initialize vmx->guest_msrs[] right after allocation")
Cc: stable@vger.kernel.org
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-6-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a dedicated intercept enum for RDPID instead of piggybacking RDTSCP.
Unlike VMX's ENABLE_RDTSCP, RDPID is not bound to SVM's RDTSCP intercept.
Fixes: fb6d4d340e ("KVM: x86: emulate RDPID")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-5-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intercept RDTSCP to inject #UD if RDTSC is disabled in the guest.
Note, SVM does not support intercepting RDPID. Unlike VMX's
ENABLE_RDTSCP control, RDTSCP interception does not apply to RDPID. This
is a benign virtualization hole as the host kernel (incorrectly) sets
MSR_TSC_AUX if RDTSCP is supported, and KVM loads the guest's MSR_TSC_AUX
into hardware if RDTSCP is supported in the host, i.e. KVM will not leak
the host's MSR_TSC_AUX to the guest.
But, when the kernel bug is fixed, KVM will start leaking the host's
MSR_TSC_AUX if RDPID is supported in hardware, but RDTSCP isn't available
for whatever reason. This leak will be remedied in a future commit.
Fixes: 46896c73c1 ("KVM: svm: add support for RDTSCP")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-4-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not advertise emulation support for RDPID if RDTSCP is unsupported.
RDPID emulation subtly relies on MSR_TSC_AUX to exist in hardware, as
both vmx_get_msr() and svm_get_msr() will return an error if the MSR is
unsupported, i.e. ctxt->ops->get_msr() will fail and the emulator will
inject a #UD.
Note, RDPID emulation also relies on RDTSCP being enabled in the guest,
but this is a KVM bug and will eventually be fixed.
Fixes: fb6d4d340e ("KVM: x86: emulate RDPID")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-3-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Clear KVM's RDPID capability if the ENABLE_RDTSCP secondary exec control is
unsupported. Despite being enumerated in a separate CPUID flag, RDPID is
bundled under the same VMCS control as RDTSCP and will #UD in VMX non-root
if ENABLE_RDTSCP is not enabled.
Fixes: 41cd02c6f7 ("kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-2-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While in most cases, when returning to use the VMCB01,
the exit reason stored in it will be SVM_EXIT_VMRUN,
on first VM exit after a nested migration this field
can contain anything since the VM entry did happen
before the migration.
Remove this warning to avoid the false positive.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210504143936.1644378-3-mlevitsk@redhat.com>
Fixes: 9a7de6ecc3 ("KVM: nSVM: If VMRUN is single-stepped, queue the #DB intercept in nested_svm_vmexit()")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>