When Enlightened VMCS is in use by L1 hypervisor we can avoid vmwriting
VMCS fields which did not change.
Our first goal is to achieve minimal impact on traditional VMCS case so
we're not wrapping each vmwrite() with an if-changed checker. We also can't
utilize static keys as Enlightened VMCS usage is per-guest.
This patch implements the simpliest solution: checking fields in groups.
We skip single vmwrite() statements as doing the check will cost us
something even in non-evmcs case and the win is tiny. Unfortunately, this
makes prepare_vmcs02_full{,_full}() code Enlightened VMCS-dependent (and
a bit ugly).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Per Hyper-V TLFS 5.0b:
"The L1 hypervisor may choose to use enlightened VMCSs by writing 1 to
the corresponding field in the VP assist page (see section 7.8.7).
Another field in the VP assist page controls the currently active
enlightened VMCS. Each enlightened VMCS is exactly one page (4 KB) in
size and must be initially zeroed. No VMPTRLD instruction must be
executed to make an enlightened VMCS active or current.
After the L1 hypervisor performs a VM entry with an enlightened VMCS,
the VMCS is considered active on the processor. An enlightened VMCS
can only be active on a single processor at the same time. The L1
hypervisor can execute a VMCLEAR instruction to transition an
enlightened VMCS from the active to the non-active state. Any VMREAD
or VMWRITE instructions while an enlightened VMCS is active is
unsupported and can result in unexpected behavior."
Keep Enlightened VMCS structure for the current L2 guest permanently mapped
from struct nested_vmx instead of mapping it every time.
Suggested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds hv_evmcs pointer and implement copy_enlightened_to_vmcs12() and
copy_enlightened_to_vmcs12().
prepare_vmcs02()/prepare_vmcs02_full() separation is not valid for
Enlightened VMCS, do full sync for now.
Suggested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Enlightened VMCS is opt-in. The current version does not contain all
fields supported by nested VMX so we must not advertise the
corresponding VMX features if enlightened VMCS is enabled.
Userspace is given the enlightened VMCS version supported by KVM as
part of enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS. The version is to
be advertised to the nested hypervisor, currently done via a cpuid
leaf for Hyper-V.
Suggested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split off EVMCS1_UNSUPPORTED_* macros so we can re-use them when
enabling Enlightened VMCS for Hyper-V on KVM.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The state related to the VP assist page is still managed by the LAPIC
code in the pv_eoi field.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The original comment is little hard to understand.
No functional change, just amend the comment a little.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
rmap_remove() removes the sptep after locating the correct rmap_head but,
in several cases, the caller has already known the correct rmap_head.
This patch introduces a new pte_list_remove(); because it is known that
the spte is present (or it would not have an rmap_head), it is safe
to remove the tracking bits without any previous check.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Coalesced pio is based on coalesced mmio and can be used for some port
like rtc port, pci-host config port and so on.
Specially in case of rtc as coalesced pio, some versions of windows guest
access rtc frequently because of rtc as system tick. guest access rtc like
this: write register index to 0x70, then write or read data from 0x71.
writing 0x70 port is just as index and do nothing else. So we can use
coalesced pio to handle this scene to reduce VM-EXIT time.
When starting and closing a virtual machine, it will access pci-host config
port frequently. So setting these port as coalesced pio can reduce startup
and shutdown time.
without my patch, get the vm-exit time of accessing rtc 0x70 and piix 0xcf8
using perf tools: (guest OS : windows 7 64bit)
IO Port Access Samples Samples% Time% Min Time Max Time Avg time
0x70:POUT 86 30.99% 74.59% 9us 29us 10.75us (+- 3.41%)
0xcf8:POUT 1119 2.60% 2.12% 2.79us 56.83us 3.41us (+- 2.23%)
with my patch
IO Port Access Samples Samples% Time% Min Time Max Time Avg time
0x70:POUT 106 32.02% 29.47% 0us 10us 1.57us (+- 7.38%)
0xcf8:POUT 1065 1.67% 0.28% 0.41us 65.44us 0.66us (+- 10.55%)
Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If ept table pointers are mismatched, flushing tlb for each vcpus via
hv flush interface still helps to reduce vmexits which are triggered
by IPI and INEPT emulation.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86_64 zero-extends 32bit xor to a full 64bit register. Use %k asm
operand modifier to force 32bit register and save 268 bytes in kvm.o
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Recently the minimum required version of binutils was changed to 2.20,
which supports all VMX instruction mnemonics. The patch removes
all .byte #defines and uses real instruction mnemonics instead.
The compiler is now able to pass memory operand to the instruction,
so there is no need for memory clobber anymore. Also, the compiler
adds CC register clobber automatically to all extended asm clauses,
so the patch also removes explicit CC clobber.
The immediate benefit of the patch is removal of many unnecesary
register moves, resulting in 1434 saved bytes in vmx.o:
text data bss dec hex filename
151257 18246 8500 178003 2b753 vmx.o
152691 18246 8500 179437 2bced vmx-old.o
Some examples of improvement include removal of unneeded moves
of %rsp to %rax in front of invept and invvpid instructions:
a57e: b9 01 00 00 00 mov $0x1,%ecx
a583: 48 89 04 24 mov %rax,(%rsp)
a587: 48 89 e0 mov %rsp,%rax
a58a: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
a591: 00 00
a593: 66 0f 38 80 08 invept (%rax),%rcx
to:
a45c: 48 89 04 24 mov %rax,(%rsp)
a460: b8 01 00 00 00 mov $0x1,%eax
a465: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
a46c: 00 00
a46e: 66 0f 38 80 04 24 invept (%rsp),%rax
and the ability to use more optimal registers and memory operands
in the instruction:
8faa: 48 8b 44 24 28 mov 0x28(%rsp),%rax
8faf: 4c 89 c2 mov %r8,%rdx
8fb2: 0f 79 d0 vmwrite %rax,%rdx
to:
8e7c: 44 0f 79 44 24 28 vmwrite 0x28(%rsp),%r8
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Register operand size of invvpid and invept instruction in 64-bit mode
has always 64 bits. Adjust inline function argument type to reflect
correct size.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We don't use root page role for nested_mmu, however, optimizing out
re-initialization in case nothing changed is still valuable as this
is done for every nested vmentry.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MMU reconfiguration in init_kvm_tdp_mmu()/kvm_init_shadow_mmu() can be
avoided if the source data used to configure it didn't change; enhance
MMU extended role with the required fields and consolidate common code in
kvm_calc_mmu_role_common().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MMU re-initialization is expensive, in particular,
update_permission_bitmask() and update_pkru_bitmask() are.
Cache the data used to setup shadow EPT MMU and avoid full re-init when
it is unchanged.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation to MMU reconfiguration avoidance we need a space to
cache source data. As this partially intersects with kvm_mmu_page_role,
create 64bit sized union kvm_mmu_role holding both base and extended data.
No functional change.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Just inline the contents into the sole caller, kvm_init_mmu is now
public.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
When EPT is used for nested guest we need to re-init MMU as shadow
EPT MMU (nested_ept_init_mmu_context() does that). When we return back
from L2 to L1 kvm_mmu_reset_context() in nested_vmx_load_cr3() resets
MMU back to normal TDP mode. Add a special 'guest_mmu' so we can use
separate root caches; the improved hit rate is not very important for
single vCPU performance, but it avoids contention on the mmu_lock for
many vCPUs.
On the nested CPUID benchmark, with 16 vCPUs, an L2->L1->L2 vmexit
goes from 42k to 26k cycles.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add an option to specify which MMU root we want to free. This will
be used when nested and non-nested MMUs for L1 are split.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
kvm_init_shadow_ept_mmu() doesn't set get_pdptr() hook and is this
not a problem just because MMU context is already initialized and this
hook points to kvm_pdptr_read(). As we're intended to use a dedicated
MMU for shadow EPT MMU set this hook explicitly.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu
a pointer to the currently used mmu. For now, this is always
vcpu->arch.root_mmu. No functional change.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
The quote from the comment almost says it all: we are currently zeroing
the guest dr6 in kvm_arch_vcpu_put, because do_debug expects it. However,
the host %dr6 is either:
- zero because the guest hasn't run after kvm_arch_vcpu_load
- written from vcpu->arch.dr6 by vcpu_enter_guest
- written by the guest and copied to vcpu->arch.dr6 by ->sync_dirty_debug_regs().
Therefore, we can skip the write if vcpu->arch.dr6 is already zero. We
may do extra useless writes if vcpu->arch.dr6 is nonzero but the guest
hasn't run; however that is less important for performance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rewrite kvm_hv_flush_tlb()/send_ipi_vcpus_mask() making them cleaner and
somewhat more optimal.
hv_vcpu_in_sparse_set() is converted to sparse_set_to_vcpu_mask()
which copies sparse banks u64-at-a-time and then, depending on the
num_mismatched_vp_indexes value, returns immediately or does
vp index to vcpu index conversion by walking all vCPUs.
To support the change and make kvm_hv_send_ipi() look similar to
kvm_hv_flush_tlb() send_ipi_vcpus_mask() is introduced.
Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Regardless of whether your TLB is lush or not it still needs flushing.
Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When early consistency checks are enabled, all VMFail conditions
should be caught by nested_vmx_check_vmentry_hw().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM defers many VMX consistency checks to the CPU, ostensibly for
performance reasons[1], including checks that result in VMFail (as
opposed to VMExit). This behavior may be undesirable for some users
since this means KVM detects certain classes of VMFail only after it
has processed guest state, e.g. emulated MSR load-on-entry. Because
there is a strict ordering between checks that cause VMFail and those
that cause VMExit, i.e. all VMFail checks are performed before any
checks that cause VMExit, we can detect (almost) all VMFail conditions
via a dry run of sorts. The almost qualifier exists because some
state in vmcs02 comes from L0, e.g. VPID, which means that hardware
will never detect an invalid VPID in vmcs12 because it never sees
said value. Software must (continue to) explicitly check such fields.
After preparing vmcs02 with all state needed to pass the VMFail
consistency checks, optionally do a "test" VMEnter with an invalid
GUEST_RFLAGS. If the VMEnter results in a VMExit (due to bad guest
state), then we can safely say that the nested VMEnter should not
VMFail, i.e. any VMFail encountered in nested_vmx_vmexit() must
be due to an L0 bug. GUEST_RFLAGS is used to induce VMExit as it
is unconditionally loaded on all implementations of VMX, has an
invalid value that is writable on a 32-bit system and its consistency
check is performed relatively early in all implementations (the exact
order of consistency checks is micro-architectural).
Unfortunately, since the "passing" case causes a VMExit, KVM must
be extra diligent to ensure that host state is restored, e.g. DR7
and RFLAGS are reset on VMExit. Failure to restore RFLAGS.IF is
particularly fatal.
And of course the extra VMEnter and VMExit impacts performance.
The raw overhead of the early consistency checks is ~6% on modern
hardware (though this could easily vary based on configuration),
while the added latency observed from the L1 VMM is ~10%. The
early consistency checks do not occur in a vacuum, e.g. spending
more time in L0 can lead to more interrupts being serviced while
emulating VMEnter, thereby increasing the latency observed by L1.
Add a module param, early_consistency_checks, to provide control
over whether or not VMX performs the early consistency checks.
In addition to standard on/off behavior, the param accepts a value
of -1, which is essentialy an "auto" setting whereby KVM does
the early checks only when it thinks it's running on bare metal.
When running nested, doing early checks is of dubious value since
the resulting behavior is heavily dependent on L0. In the future,
the "auto" setting could also be used to default to skipping the
early hardware checks for certain configurations/platforms if KVM
reaches a state where it has 100% coverage of VMFail conditions.
[1] To my knowledge no one has implemented and tested full software
emulation of the VMFail consistency checks. Until that happens,
one can only speculate about the actual performance overhead of
doing all VMFail consistency checks in software. Obviously any
code is slower than no code, but in the grand scheme of nested
virtualization it's entirely possible the overhead is negligible.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
EFER is constant in the host and writing it once during setup means
we can skip writing the host value in add_atomic_switch_msr_special().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
... as every invocation of nested_vmx_{fail,succeed} is immediately
followed by a call to kvm_skip_emulated_instruction(). This saves
a bit of code and eliminates some silly paths, e.g. nested_vmx_run()
ended up with a goto label purely used to call and return
kvm_skip_emulated_instruction().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
EFLAGS is set to a fixed value on VMExit, calling nested_vmx_succeed()
is unnecessary and wrong.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A successful VMEnter is essentially a fancy indirect branch that
pulls the target RIP from the VMCS. Skipping the instruction is
unnecessary (RIP will get overwritten by the VMExit handler) and
is problematic because it can incorrectly suppress a #DB due to
EFLAGS.TF when a VMFail is detected by hardware (happens after we
skip the instruction).
Now that vmx_nested_run() is not prematurely skipping the instr,
use the full kvm_skip_emulated_instruction() in the VMFail path
of nested_vmx_vmexit(). We also need to explicitly update the
GUEST_INTERRUPTIBILITY_INFO when loading vmcs12 host state.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In anticipation of using vmcs02 to do early consistency checks, move
the early preparation of vmcs02 prior to checking the postreqs. The
downside of this approach is that we'll unnecessary load vmcs02 in
the case that check_vmentry_postreqs() fails, but that is essentially
our slow path anyways (not actually slow, but it's the path we don't
really care about optimizing).
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a dedicated flag to track if vmcs02 has been initialized, i.e.
the constant state for vmcs02 has been written to the backing VMCS.
The launched flag (in struct loaded_vmcs) gets cleared on logical
CPU migration to mirror hardware behavior[1], i.e. using the launched
flag to determine whether or not vmcs02 constant state needs to be
initialized results in unnecessarily re-initializing the VMCS when
migrating between logical CPUS.
[1] The active VMCS needs to be VMCLEARed before it can be migrated
to a different logical CPU. Hardware's VMCS cache is per-CPU
and is not coherent between CPUs. VMCLEAR flushes the cache so
that any dirty data is written back to memory. A side effect
of VMCLEAR is that it also clears the VMCS's internal launch
flag, which KVM must mirror because VMRESUME must be used to
run a previously launched VMCS.
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add prepare_vmcs02_early() and move pieces of prepare_vmcs02() to the
new function. prepare_vmcs02_early() writes the bits of vmcs02 that
a) must be in place to pass the VMFail consistency checks (assuming
vmcs12 is valid) and b) are needed recover from a VMExit, e.g. host
state that is loaded on VMExit. Splitting the functionality will
enable KVM to leverage hardware to do VMFail consistency checks via
a dry run of VMEnter and recover from a potential VMExit without
having to fully initialize vmcs02.
Add prepare_vmcs02_constant_state() to handle writing vmcs02 state that
comes from vmcs01 and never changes, i.e. we don't need to rewrite any
of the vmcs02 that is effectively constant once defined.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmx->pml_pg is allocated by vmx_create_vcpu() and is only nullified
when the vCPU is destroyed by vmx_free_vcpu(). Remove the ASSERTs
on vmx->pml_pg, there is no need to carry debug code that provides
no value to the current code base.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename 'fail' to 'vmentry_fail_vmexit_guest_mode' to make it more
obvious that it's simply a different entry point to the VMExit path,
whose purpose is unwind the updates done prior to calling
prepare_vmcs02().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handling all VMExits due to failed consistency checks on VMEnter in
nested_vmx_enter_non_root_mode() consolidates all relevant code into
a single location, and removing nested_vmx_entry_failure() eliminates
a confusing function name and label. For a VMEntry, "fail" and its
derivatives has a very specific meaning due to the different behavior
of a VMEnter VMFail versus VMExit, i.e. it wasn't obvious that
nested_vmx_entry_failure() handled VMExit scenarios.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation of supporting checkpoint/restore for nested state,
commit ca0bde28f2 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()")
modified check_vmentry_postreqs() to only perform the guest EFER
consistency checks when nested_run_pending is true. But, in the
normal nested VMEntry flow, nested_run_pending is only set after
check_vmentry_postreqs(), i.e. the consistency check is being skipped.
Alternatively, nested_run_pending could be set prior to calling
check_vmentry_postreqs() in nested_vmx_run(), but placing the
consistency checks in nested_vmx_enter_non_root_mode() allows us
to split prepare_vmcs02() and interleave the preparation with
the consistency checks without having to change the call sites
of nested_vmx_enter_non_root_mode(). In other words, the rest
of the consistency check code in nested_vmx_run() will be joining
the postreqs checks in future patches.
Fixes: ca0bde28f2 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to be more consistent with the nested VMX nomenclature.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VM_ENTRY_IA32E_MODE and VM_{ENTRY,EXIT}_LOAD_IA32_EFER will be
explicitly set/cleared as needed by vmx_set_efer(), but attempt
to get the bits set correctly when intializing the control fields.
Setting the value correctly can avoid multiple VMWrites.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not unconditionally call clear_atomic_switch_msr() when updating
EFER. This adds up to four unnecessary VMWrites in the case where
guest_efer != host_efer, e.g. if the load_on_{entry,exit} bits were
already set.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reset the vm_{entry,exit}_controls_shadow variables as well as the
segment cache after loading a new VMCS in vmx_switch_vmcs(). The
shadows/cache track VMCS data, i.e. they're stale every time we
switch to a new VMCS regardless of reason.
This fixes a bug where stale control shadows would be consumed after
a nested VMExit due to a failed consistency check.
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Write VM_EXIT_CONTROLS using vm_exit_controls_init() when configuring
vmcs02, otherwise vm_exit_controls_shadow will be stale. EFER in
particular can be corrupted if VM_EXIT_LOAD_IA32_EFER is not updated
due to an incorrect shadow optimization, which can crash L0 due to
EFER not being loaded on exit. This does not occur with the current
code base simply because update_transition_efer() unconditionally
clears VM_EXIT_LOAD_IA32_EFER before conditionally setting it, and
because a nested guest always starts with VM_EXIT_LOAD_IA32_EFER
clear, i.e. we'll only ever unnecessarily clear the bit. That is,
until someone optimizes update_transition_efer()...
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An invalid EPTP causes a VMFail(VMXERR_ENTRY_INVALID_CONTROL_FIELD),
not a VMExit.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Invalid host state related to loading EFER on VMExit causes a
VMFail(VMXERR_ENTRY_INVALID_HOST_STATE_FIELD), not a VMExit.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
update_memslots() is only called by __kvm_set_memory_region(), in which
"change" is calculated and indicates how to adjust slots->used_slots
* increase by one if it is KVM_MR_CREATE
* decrease by one if it is KVM_MR_DELETE
* not change for others
This patch adjusts slots->used_slots in update_memslots() based on "change"
value instead of re-calculate those states again.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When bit 3 (corresponding to CR0.TS) of the VMCS12 cr0_guest_host_mask
field is clear, the VMCS12 guest_cr0 field does not necessarily hold
the current value of the L2 CR0.TS bit, so the code that checked for
L2's CR0.TS bit being set was incorrect. Moreover, I'm not sure that
the CR0.TS check was adequate. (What if L2's CR0.EM was set, for
instance?)
Fortunately, lazy FPU has gone away, so L0 has lost all interest in
intercepting #NM exceptions. See commit bd7e5b0899 ("KVM: x86:
remove code for lazy FPU handling"). Therefore, there is no longer any
question of which hypervisor gets first dibs. The #NM VM-exit should
always be reflected to L1. (Note that the corresponding bit must be
set in the VMCS12 exception_bitmap field for there to be an #NM
VM-exit at all.)
Fixes: ccf9844e5d ("kvm, vmx: Really fix lazy FPU on nested guest")
Reported-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Tested-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>