linux/arch
Sean Christopherson 05e788500b KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail
[ Upstream commit bd18bffca3 ]

A VMEnter that VMFails (as opposed to VMExits) does not touch host
state beyond registers that are explicitly noted in the VMFail path,
e.g. EFLAGS.  Host state does not need to be loaded because VMFail
is only signaled for consistency checks that occur before the CPU
starts to load guest state, i.e. there is no need to restore any
state as nothing has been modified.  But in the case where a VMFail
is detected by hardware and not by KVM (due to deferring consistency
checks to hardware), KVM has already loaded some amount of guest
state.  Luckily, "loaded" only means loaded to KVM's software model,
i.e. vmcs01 has not been modified.  So, unwind our software model to
the pre-VMEntry host state.

Not restoring host state in this VMFail path leads to a variety of
failures because we end up with stale data in vcpu->arch, e.g. CR0,
CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
significant delta in the stale data is all but guaranteed to crash
L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.

An alternative to this "soft" reload would be to load host state from
vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
wildly inconsistent with respect to the VMX architecture, e.g. an L1
VMM with separate VMExit and VMFail paths would explode.

Note that this approach does not mean KVM is 100% accurate with
respect to VMX hardware behavior, even at an architectural level
(the exact order of consistency checks is microarchitecture specific).
But 100% emulation accuracy isn't the goal (with this patch), rather
the goal is to be consistent in the information delivered to L1, e.g.
a VMExit should not fall-through VMENTER, and a VMFail should not jump
to HOST_RIP.

This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
context after VMLAUNCH/VMRESUME failure)", but retains the core
aspects of that patch, just in an open coded form due to the need to
pull state from vmcs01 instead of vmcs12.  Restoring host state
resolves a variety of issues introduced by commit "4f350c6dbcb9
(kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
which remedied the incorrect behavior of treating VMFail like VMExit
but in doing so neglected to restore arch state that had been modified
prior to attempting nested VMEnter.

A sample failure that occurs due to stale vcpu.arch state is a fault
of some form while emulating an LGDT (due to emulated UMIP) from L1
after a failed VMEntry to L3, in this case when running the KVM unit
test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
case due to a stale arch.cr4.UMIP.

L1:
  BUG: unable to handle kernel paging request at ffffc90000663b9e
  PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
  Oops: 0009 [#1] SMP
  CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:native_load_gdt+0x0/0x10

  ...

  Call Trace:
   load_fixmap_gdt+0x22/0x30
   __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
   vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
   nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
   vmx_handle_exit+0x246/0x15a0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
   kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
   do_vfs_ioctl+0x9f/0x600
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x4f/0x100
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

L0:
  WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
  ...
  CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
  Hardware name: Intel Corporation Kabylake Client platform/KBL S
  RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]

  ...

  Call Trace:
   kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
   kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
   do_vfs_ioctl+0x9f/0x5e0
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x49/0xf0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 5af4157388 (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
Fixes: 4f350c6dbc (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-20 09:15:05 +02:00
..
alpha alpha: Fix Eiger NR_IRQS to 128 2019-02-20 10:20:53 +01:00
arc arc: hsdk_defconfig: Enable CONFIG_BLK_DEV_RAM 2019-04-20 09:14:59 +02:00
arm ARM: samsung: Limit SAMSUNG_PM_CHECK config option to non-Exynos platforms 2019-04-20 09:15:05 +02:00
arm64 arm64: dts: rockchip: Fix vcc_host1_5v GPIO polarity on rk3328-rock64 2019-04-17 08:37:55 +02:00
blackfin pinctrl: adi2: Fix Kconfig build problem 2017-12-20 10:10:34 +01:00
c6x License cleanup: add SPDX license identifier to uapi header files with a license 2017-11-02 11:20:11 +01:00
cris bug.h: work around GCC PR82365 in BUG() 2018-05-30 07:52:00 +02:00
frv License cleanup: add SPDX license identifier to uapi header files with a license 2017-11-02 11:20:11 +01:00
h8300 h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux- 2019-04-05 22:31:25 +02:00
hexagon hexagon: modify ffs() and fls() to return int 2018-10-10 08:54:25 +02:00
ia64 ia64/err-inject: Use get_user_pages_fast() 2018-05-30 07:52:11 +02:00
m32r m32r: fix endianness constraints 2018-02-28 10:19:44 +01:00
m68k m68k: Add -ffreestanding to CFLAGS 2019-03-23 14:35:21 +01:00
metag .gitignore: move *.dtb and *.dtb.S patterns to the top-level .gitignore 2018-02-13 10:19:46 +01:00
microblaze microblaze: Fix simpleImage format generation 2018-08-03 07:50:40 +02:00
mips MIPS: Fix kernel crash for R6 in jump label branch function 2019-03-27 14:13:52 +09:00
mn10300 mn10300/misalignment: Use SIGSEGV SEGV_MAPERR to report a failed user copy 2018-02-16 20:23:11 +01:00
nios2 .gitignore: move *.dtb and *.dtb.S patterns to the top-level .gitignore 2018-02-13 10:19:46 +01:00
openrisc openrisc: entry: Fix delay slot exception detection 2018-08-24 13:09:11 +02:00
parisc parisc: regs_return_value() should return gpr28 2019-04-17 08:37:52 +02:00
powerpc powerpc/pseries: Remove prrn_work workqueue 2019-04-20 09:15:04 +02:00
s390 s390/setup: fix boot crash for machine without EDAT-1 2019-03-23 14:35:32 +01:00
score License cleanup: add SPDX license identifier to uapi header files with no license 2017-11-02 11:19:54 +01:00
sh sh: fix build failure for J2 cpu with SMP disabled 2018-06-21 04:02:54 +09:00
sparc sparc64: Make proc_id signed. 2018-11-13 11:14:50 -08:00
tile fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall 2017-12-17 15:07:59 +01:00
um um: Avoid marking pages with "changed protection" 2019-02-12 19:46:08 +01:00
unicore32 kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK 2018-02-22 15:42:23 +01:00
x86 KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail 2019-04-20 09:15:05 +02:00
xtensa xtensa: fix return_address 2019-04-17 08:37:54 +02:00
.gitignore
Kconfig compiler.h: Allow arch-specific asm/compiler.h 2018-11-04 14:52:46 +01:00