- x86: work around two nasty cases where a benign exception occurs while
another is being delivered. The endless stream of exceptions causes an
infinite loop in the processor, which not even NMIs or SMIs can interrupt;
in the virt case, there is no possibility to exit to the host either.
- x86: support for Skylake per-guest TSC rate. Long supported by AMD,
the patches mostly move things from there to common arch/x86/kvm/ code.
- generic: remove local_irq_save/restore from the guest entry and exit
paths when context tracking is enabled. The patches are a few months
old, but we discussed them again at kernel summit. Andy will pick up
from here and, in 4.5, try to remove it from the user entry/exit paths.
- PPC: Two bug fixes, see merge commit 370289756b for details.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWRFb0AAoJEL/70l94x66DjjMH/31jr8d119MW0uv2x+03+wRq
6dbJ8tjQ8grvBRExKvLsUVjDmHlhCa1BQl5qjCsyYhX9UeAf4NQOmoEFpq+YTLxh
Ctveyn+yiZWC7qxbQDmauiQ4JCOp+W9ial782iqw5+ouQMajGOffq5WrojCa2ZNF
jI278JgdHJLrKj/uie//WBu3V7MJY5Apc3p4zatnSYFSQ3MA0sxl4r4zIrwOa5qs
23ZeeoqbP4sHh4X5wL/30Y6XFSCHj0qoYHHyAgzLi0PCMvBdt4DrAFUPDG/Rhlv6
o1WB/kcUfcz3DtBX85wfSOMuw0nF6patWhWv07R/3EIbYoz3dKvp9d6ORYgXqlY=
=Um9M
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second batch of kvm updates from Paolo Bonzini:
"Four changes:
- x86: work around two nasty cases where a benign exception occurs
while another is being delivered. The endless stream of exceptions
causes an infinite loop in the processor, which not even NMIs or
SMIs can interrupt; in the virt case, there is no possibility to
exit to the host either.
- x86: support for Skylake per-guest TSC rate. Long supported by
AMD, the patches mostly move things from there to common
arch/x86/kvm/ code.
- generic: remove local_irq_save/restore from the guest entry and
exit paths when context tracking is enabled. The patches are a few
months old, but we discussed them again at kernel summit. Andy
will pick up from here and, in 4.5, try to remove it from the user
entry/exit paths.
- PPC: Two bug fixes, see merge commit 370289756b for details"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: x86: rename update_db_bp_intercept to update_bp_intercept
KVM: svm: unconditionally intercept #DB
KVM: x86: work around infinite loop in microcode when #AC is delivered
context_tracking: avoid irq_save/irq_restore on guest entry and exit
context_tracking: remove duplicate enabled check
KVM: VMX: Dump TSC multiplier in dump_vmcs()
KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
KVM: VMX: Enable and initialize VMX TSC scaling
KVM: x86: Use the correct vcpu's TSC rate to compute time scale
KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
KVM: x86: Replace call-back compute_tsc_offset() with a common function
KVM: x86: Replace call-back set_tsc_khz() with a common function
KVM: x86: Add a common TSC scaling function
KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
KVM: x86: Collect information for setting TSC scaling ratio
KVM: x86: declare a few variables as __read_mostly
KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
KVM: PPC: Book3S HV: Don't dynamically split core when already split
...
Because #DB is now intercepted unconditionally, this callback
only operates on #BP for both VMX and SVM.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch makes KVM use virtual_tsc_khz rather than the host TSC rate
as vcpu's TSC rate to compute the time scale if TSC scaling is enabled.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both VMX and SVM scales the host TSC in the same way in call-back
read_l1_tsc(), so this patch moves the scaling logic from call-back
read_l1_tsc() to a common function kvm_read_l1_tsc().
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For both VMX and SVM, if the 2nd argument of call-back
adjust_tsc_offset() is the host TSC, then adjust_tsc_offset() will scale
it first. This patch moves this common TSC scaling logic to its caller
adjust_tsc_offset_host() and rename the call-back adjust_tsc_offset() to
adjust_tsc_offset_guest().
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both VMX and SVM calculate the tsc-offset in the same way, so this
patch removes the call-back compute_tsc_offset() and replaces it with a
common function kvm_compute_tsc_offset().
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both VMX and SVM propagate virtual_tsc_khz in the same way, so this
patch removes the call-back set_tsc_khz() and replaces it with a common
function.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX and SVM calculate the TSC scaling ratio in a similar logic, so this
patch generalizes it to a common TSC scaling function.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
[Inline the multiplication and shift steps into mul_u64_u64_shr. Remove
BUG_ON. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch moves the field of TSC scaling ratio from the architecture
struct vcpu_svm to the common struct kvm_vcpu_arch.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The number of bits of the fractional part of the 64-bit TSC scaling
ratio in VMX and SVM is different. This patch makes the architecture
code to collect the number of fractional bits and other related
information into variables that can be accessed in the common code.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
handling.
PPC: Mostly bug fixes.
ARM: No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite for
IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86: quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new component (in
virt/lib/) that connects VFIO and KVM together. The same infrastructure
will be used for ARM interrupt forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic interrupt
controller will have to wait for 4.5. These will let KVM expose Hyper-V
devices.
- nested virtualization now supports VPID (same as PCID but for vCPUs)
which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for clflushopt,
clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel + IOAPIC/PIC/PIT in
userspace, which reduces the attack surface of the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten to not
require help from the hypervisor.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWO2IQAAoJEL/70l94x66D/K0H/3AovAgYmJQToZlimsktMk6a
f2xhdIqfU5lIQQh5uNBCfL3o9o8H9Py1ym7aEw3fmztPHHJYc91oTatt2UEKhmEw
VtZHp/dFHt3hwaIdXmjRPEXiYctraKCyrhaUYdWmUYkoKi7lW5OL5h+S7frG2U6u
p/hFKnHRZfXHr6NSgIqvYkKqtnc+C0FWY696IZMzgCksOO8jB1xrxoSN3tANW3oJ
PDV+4og0fN/Fr1capJUFEc/fejREHneANvlKrLaa8ht0qJQutoczNADUiSFLcMPG
iHljXeDsv5eyjMtUuIL8+MPzcrIt/y4rY41ZPiKggxULrXc6H+JJL/e/zThZpXc=
=iv2z
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.4.
s390:
A bunch of fixes and optimizations for interrupt and time handling.
PPC:
Mostly bug fixes.
ARM:
No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite
for IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86:
Quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new
component (in virt/lib/) that connects VFIO and KVM together.
The same infrastructure will be used for ARM interrupt
forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic
interrupt controller will have to wait for 4.5. These will let
KVM expose Hyper-V devices.
- nested virtualization now supports VPID (same as PCID but for
vCPUs) which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for
clflushopt, clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel +
IOAPIC/PIC/PIT in userspace, which reduces the attack surface of
the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten
to not require help from the hypervisor"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits)
KVM: VMX: Fix commit which broke PML
KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()
KVM: x86: allow RSM from 64-bit mode
KVM: VMX: fix SMEP and SMAP without EPT
KVM: x86: move kvm_set_irq_inatomic to legacy device assignment
KVM: device assignment: remove pointless #ifdefs
KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
KVM: x86: zero apic_arb_prio on reset
drivers/hv: share Hyper-V SynIC constants with userspace
KVM: x86: handle SMBASE as physical address in RSM
KVM: x86: add read_phys to x86_emulate_ops
KVM: x86: removing unused variable
KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs
KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()
KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
KVM: arm/arm64: Optimize away redundant LR tracking
KVM: s390: use simple switch statement as multiplexer
KVM: s390: drop useless newline in debugging data
KVM: s390: SCA must not cross page boundaries
KVM: arm: Do not indent the arguments of DECLARE_BITMAP
...
Commit b18d5431ac ("KVM: x86: fix CR0.CD virtualization") was
technically correct, but it broke OVMF guests by slowing down various
parts of the firmware.
Commit fb279950ba ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") quirked the
first function modified by b18d5431ac, vmx_get_mt_mask(), for OVMF's
sake. This restored the speed of the OVMF code that runs before
PlatformPei (including the memory intensive LZMA decompression in SEC).
This patch extends the quirk to the second function modified by
b18d5431ac, kvm_set_cr0(). It eliminates the intrusive slowdown that
hits the EFI_MP_SERVICES_PROTOCOL implementation of edk2's
UefiCpuPkg/CpuDxe -- which is built into OVMF --, when CpuDxe starts up
all APs at once for initialization, in order to count them.
We also carry over the kvm_arch_has_noncoherent_dma() sub-condition from
the other half of the original commit b18d5431ac.
Fixes: b18d5431ac
Cc: stable@vger.kernel.org
Cc: Jordan Justen <jordan.l.justen@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Tested-by: Janusz Mocek <januszmk6@gmail.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>#
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We want to read the physical memory when emulating RSM.
X86EMUL_IO_NEEDED is returned on all errors for consistency with other
helpers.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Laszlo Ersek <lersek@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
removing unused variables, found by coccinelle
Signed-off-by: Saurabh Sengar <saurabh.truth@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull x86 fpu changes from Ingo Molnar:
"There are two main areas of changes:
- Rework of the extended FPU state code to robustify the kernel's
usage of cpuid provided xstate sizes - and related changes (Dave
Hansen)"
- math emulation enhancements: new modern FPU instructions support,
with testcases, plus cleanups (Denys Vlasnko)"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/fpu: Fixup uninitialized feature_name warning
x86/fpu/math-emu: Add support for FISTTP instructions
x86/fpu/math-emu, selftests: Add test for FISTTP instructions
x86/fpu/math-emu: Add support for FCMOVcc insns
x86/fpu/math-emu: Add support for F[U]COMI[P] insns
x86/fpu/math-emu: Remove define layer for undocumented opcodes
x86/fpu/math-emu, selftests: Add tests for FCMOV and FCOMI insns
x86/fpu/math-emu: Remove !NO_UNDOC_CODE
x86/fpu: Check CPU-provided sizes against struct declarations
x86/fpu: Check to ensure increasing-offset xstate offsets
x86/fpu: Correct and check XSAVE xstate size calculations
x86/fpu: Add C structures for AVX-512 state components
x86/fpu: Rework YMM definition
x86/fpu/mpx: Rework MPX 'xstate' types
x86/fpu: Add xfeature_enabled() helper instead of test_bit()
x86/fpu: Remove 'xfeature_nr'
x86/fpu: Rework XSTATE_* macros to remove magic '2'
x86/fpu: Rename XFEATURES_NR_MAX
x86/fpu: Rename XSAVE macros
x86/fpu: Remove partial LWP support definitions
...
As reported at https://bugs.launchpad.net/qemu/+bug/1494350,
it is possible to have vcpu->arch.st.last_steal initialized
from a thread other than vcpu thread, say the iothread, via
KVM_SET_MSRS.
Which can cause an overflow later (when subtracting from vcpu threads
sched_info.run_delay).
To avoid that, move steal time accumulation to vcpu entry time,
before copying steal time data to guest.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM uses eoi_exit_bitmap to track vectors that need an action on EOI.
The problem is that IOAPIC can be reconfigured while an interrupt with
old configuration is pending and eoi_exit_bitmap only remembers the
newest configuration; thus EOI from the pending interrupt is not
recognized.
(Reconfiguration is not a problem for level interrupts, because IOAPIC
sends interrupt with the new configuration.)
For an edge interrupt with ACK notifiers, like i8254 timer; things can
happen in this order
1) IOAPIC inject a vector from i8254
2) guest reconfigures that vector's VCPU and therefore eoi_exit_bitmap
on original VCPU gets cleared
3) guest's handler for the vector does EOI
4) KVM's EOI handler doesn't pass that vector to IOAPIC because it is
not in that VCPU's eoi_exit_bitmap
5) i8254 stops working
A simple solution is to set the IOAPIC vector in eoi_exit_bitmap if the
vector is in PIR/IRR/ISR.
This creates an unwanted situation if the vector is reused by a
non-IOAPIC source, but I think it is so rare that we don't want to make
the solution more sophisticated. The simple solution also doesn't work
if we are reconfiguring the vector. (Shouldn't happen in the wild and
I'd rather fix users of ACK notifiers instead of working around that.)
The are no races because ioapic injection and reconfig are locked.
Fixes: b053b2aef2 ("KVM: x86: Add EOI exit bitmap inference")
[Before b053b2aef2, this bug happened only with APICv.]
Fixes: c7c9c56ca2 ("x86, apicv: add virtual interrupt delivery support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This merge brings in a couple important SMM fixes, which makes it
easier to test latest KVM with unrestricted_guest=0 and to test
the in-progress work on SMM support in the firmware.
Conflicts:
arch/x86/kvm/x86.c
An SMI to a halted VCPU must wake it up, hence a VCPU with a pending
SMI must be considered runnable.
Fixes: 64d6067057
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split the huge conditional in two functions.
Fixes: 64d6067057
Cc: stable@vger.kernel.org
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Otherwise, two copies (one of them never populated and thus bogus)
are allocated for the regular and SMM address spaces. This breaks
SMM with EPT but without unrestricted guest support, because the
SMM copy of the identity page map is all zeros.
By moving the allocation to the caller we also remove the last
vestiges of kernel-allocated memory regions (not accessible anymore
in userspace since commit b74a07beed, "KVM: Remove kernel-allocated
memory regions", 2010-06-21); that is a nice bonus.
Reported-by: Alexandre DERUMIER <aderumier@odiso.com>
Cc: stable@vger.kernel.org
Fixes: 9da0e4d5ac
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The next patch will make x86_set_memory_region fill the
userspace_addr. Since the struct is not used untouched
anymore, it makes sense to build it in x86_set_memory_region
directly; it also simplifies the callers.
Reported-by: Alexandre DERUMIER <aderumier@odiso.com>
Cc: stable@vger.kernel.org
Fixes: 9da0e4d5ac
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch updates the Posted-Interrupts Descriptor when vCPU
is blocked.
pre-block:
- Add the vCPU to the blocked per-CPU list
- Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
post-block:
- Remove the vCPU from the per-CPU list
Signed-off-by: Feng Wu <feng.wu@intel.com>
[Concentrate invocation of pre/post-block hooks to vcpu_block. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Select IRQ_BYPASS_MANAGER for x86 when CONFIG_KVM is set
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds the routine to update IRTE for posted-interrupts
when guest changes the interrupt configuration.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
[Squashed in automatically generated patch from the build robot
"KVM: x86: vcpu_to_pi_desc() can be static" - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_VP_RUNTIME msr used by guest to get
"the time the virtual processor consumes running guest code,
and the time the associated logical processor spends running
hypervisor code on behalf of that guest."
Calculation of this time is performed by task_cputime_adjusted()
for vcpu task.
Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Insert Hyper-V HV_X64_MSR_VP_INDEX into msr's emulated list,
so QEMU can set Hyper-V features cpuid HV_X64_MSR_VP_INDEX_AVAILABLE
bit correctly. KVM emulation part is in place already.
Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_RESET msr is used by Hyper-V based Windows guest
to reset guest VM by hypervisor.
Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.
KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.
The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).
When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.
When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.
Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.
Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.
Compiles for x86.
Can pass the KVM Unit Tests.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to support a userspace IOAPIC interacting with an in kernel
APIC, the EOI exit bitmaps need to be configurable.
If the IOAPIC is in userspace (i.e. the irqchip has been split), the
EOI exit bitmaps will be set whenever the GSI Routes are configured.
In particular, for the low MSI routes are reservable for userspace
IOAPICs. For these MSI routes, the EOI Exit bit corresponding to the
destination vector of the route will be set for the destination VCPU.
The intention is for the userspace IOAPICs to use the reservable MSI
routes to inject interrupts into the guest.
This is a slight abuse of the notion of an MSI Route, given that MSIs
classically bypass the IOAPIC. It might be worthwhile to add an
additional route type to improve clarity.
Compile tested for Intel x86.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds KVM_EXIT_IOAPIC_EOI which allows the kernel to EOI
level-triggered IOAPIC interrupts.
Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs
to be informed (which is identical to the EOI_EXIT_BITMAP field used
by modern x86 processors, but can also be used to elide kvm IOAPIC EOI
exits on older processors).
[Note: A prototype using ResampleFDs found that decoupling the EOI
from the VCPU's thread made it possible for the VCPU to not see a
recent EOI after reentering the guest. This does not match real
hardware.]
Compile tested for Intel x86.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
First patch in a series which enables the relocation of the
PIC/IOAPIC to userspace.
Adds capability KVM_CAP_SPLIT_IRQCHIP;
KVM_CAP_SPLIT_IRQCHIP enables the construction of LAPICs without the
rest of the irqchip.
Compile tested for x86.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Suggested-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The interrupt window is currently checked twice, once in vmx.c/svm.c and
once in dm_request_for_irq_injection. The only difference is the extra
check for kvm_arch_interrupt_allowed in dm_request_for_irq_injection,
and the different return value (EINTR/KVM_EXIT_INTR for vmx.c/svm.c vs.
0/KVM_EXIT_IRQ_WINDOW_OPEN for dm_request_for_irq_injection).
However, dm_request_for_irq_injection is basically dead code! Revive it
by removing the checks in vmx.c and svm.c's vmexit handlers, and
fixing the returned values for the dm_request_for_irq_injection case.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Avoid pointer chasing and memory barriers, and simplify the code
when split irqchip (LAPIC in kernel, IOAPIC/PIC in userspace)
is introduced.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We can reuse the algorithm that computes the EOI exit bitmap to figure
out which vectors are handled by the IOAPIC. The only difference
between the two is for edge-triggered interrupts other than IRQ8
that have no notifiers active; however, the IOAPIC does not have to
do anything special for these interrupts anyway.
This again limits the interactions between the IOAPIC and the LAPIC,
making it easier to move the former to userspace.
Inspired by a patch from Steve Rutherford.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not compute TMR in advance. Instead, set the TMR just before the interrupt
is accepted into the IRR. This limits the coupling between IOAPIC and LAPIC.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
MSR is a change of ABI. Probably only 2.6.16 based SLES 10 breaks due
to its custom enhancements to kvmclock, but KVM never declared the MSR
only for one-shot initialization. (Doc says that only one write is
needed.)
This reverts commit b7e60c5aed.
And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These have roughly the same purpose as the SMRR, which we do not need
to implement in KVM. However, Linux accesses MSR_K8_TSEG_ADDR at
boot, which causes problems when running a Xen dom0 under KVM.
Just return 0, meaning that processor protection of SMRAM is not
in effect.
Reported-by: M A Young <m.a.young@durham.ac.uk>
Cc: stable@vger.kernel.org
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.
For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
is consuming about 30% more CPU than it would use without
polling. This would show as an abnormally high number of
attempted polling compared to the successful polls.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two concepts that have some confusing naming:
1. Extended State Component numbers (currently called
XFEATURE_BIT_*)
2. Extended State Component masks (currently called XSTATE_*)
The numbers are (currently) from 0-9. State component 3 is the
bounds registers for MPX, for instance.
But when we want to enable "state component 3", we go set a bit
in XCR0. The bit we set is 1<<3. We can check to see if a
state component feature is enabled by looking at its bit.
The current 'xfeature_bit's are at best xfeature bit _numbers_.
Calling them bits is at best inconsistent with ending the enum
list with 'XFEATURES_NR_MAX'.
This patch renames the enum to be 'xfeature'. These also
happen to be what the Intel documentation calls a "state
component".
We also want to differentiate these from the "XSTATE_*" macros.
The "XSTATE_*" macros are a mask, and we rename them to match.
These macros are reasonably widely used so this patch is a
wee bit big, but this really is just a rename.
The only non-mechanical part of this is the
s/XSTATE_EXTEND_MASK/XFEATURE_MASK_EXTEND/
We need a better name for it, but that's another patch.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233126.38653250@viggo.jf.intel.com
[ Ported to v4.3-rc1. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The process_smi_save_seg_64() function called only in the
process_smi_save_state_64() if the CONFIG_X86_64 is set. This
patch adds #ifdef CONFIG_X86_64 around process_smi_save_seg_64()
to prevent following warning message:
arch/x86/kvm/x86.c:5946:13: warning: ‘process_smi_save_seg_64’ defined but not used [-Wunused-function]
static void process_smi_save_seg_64(struct kvm_vcpu *vcpu, char *buf, int n)
^
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull x86 asm changes from Ingo Molnar:
"The biggest changes in this cycle were:
- Revamp, simplify (and in some cases fix) Time Stamp Counter (TSC)
primitives. (Andy Lutomirski)
- Add new, comprehensible entry and exit handlers written in C.
(Andy Lutomirski)
- vm86 mode cleanups and fixes. (Brian Gerst)
- 32-bit compat code cleanups. (Brian Gerst)
The amount of simplification in low level assembly code is already
palpable:
arch/x86/entry/entry_32.S | 130 +----
arch/x86/entry/entry_64.S | 197 ++-----
but more simplifications are planned.
There's also the usual laudry mix of low level changes - see the
changelog for details"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (83 commits)
x86/asm: Drop repeated macro of X86_EFLAGS_AC definition
x86/asm/msr: Make wrmsrl() a function
x86/asm/delay: Introduce an MWAITX-based delay with a configurable timer
x86/asm: Add MONITORX/MWAITX instruction support
x86/traps: Weaken context tracking entry assertions
x86/asm/tsc: Add rdtscll() merge helper
selftests/x86: Add syscall_nt selftest
selftests/x86: Disable sigreturn_64
x86/vdso: Emit a GNU hash
x86/entry: Remove do_notify_resume(), syscall_trace_leave(), and their TIF masks
x86/entry/32: Migrate to C exit path
x86/entry/32: Remove 32-bit syscall audit optimizations
x86/vm86: Rename vm86->v86flags and v86mask
x86/vm86: Rename vm86->vm86_info to user_vm86
x86/vm86: Clean up vm86.h includes
x86/vm86: Move the vm86 IRQ definitions to vm86.h
x86/vm86: Use the normal pt_regs area for vm86
x86/vm86: Eliminate 'struct kernel_vm86_struct'
x86/vm86: Move fields from 'struct kernel_vm86_struct' to 'struct vm86'
x86/vm86: Move vm86 fields out of 'thread_struct'
...
s390: timekeeping changes, cleanups and fixes
x86: support for Hyper-V MSRs to report crashes, and a bunch of cleanups.
One interesting feature that was planned for 4.3 (emulating the local
APIC in kernel while keeping the IOAPIC and 8254 in userspace) had to
be delayed because Intel complained about my reading of the manual.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJVznW4AAoJEL/70l94x66Dt+gH/3vydhh6kv+mKhnR+kADaGfM
gaunw0CUpJLU6gkOkYOm5M32WGhsT9Hd3WtRTJO6PhSo7cQ88hMx24u4XAffoewo
Os5tDwAaHeV2enVSTri6xX8e2F2mgPDghGcYJPUBwnmMjRzZ8tj2VHUcbxqVT6Pb
pX3V8ZxOZ81+ACZU2tdNRzLUd2H1v4d74gtVS7ove1Vb0CvPOBdHf1KQuUCUa2Pi
73fvnaEuSaFYtSWZIP1PYxLnsQHpApH3Kco/5kHeqUPpYaGa/g2bnfncHRw20Svr
gb3opwbfyiq91xfGbRVR3+E63Cw4G6aTl5MDNv9UFJ+xFKuj8WJ72xXXTSwzUi4=
=HgT+
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.3-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"A very small release for x86 and s390 KVM.
- s390: timekeeping changes, cleanups and fixes
- x86: support for Hyper-V MSRs to report crashes, and a bunch of
cleanups.
One interesting feature that was planned for 4.3 (emulating the local
APIC in kernel while keeping the IOAPIC and 8254 in userspace) had to
be delayed because Intel complained about my reading of the manual"
* tag 'kvm-4.3-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
x86/kvm: Rename VMX's segment access rights defines
KVM: x86/vPMU: Fix unnecessary signed extension for AMD PERFCTRn
kvm: x86: Fix error handling in the function kvm_lapic_sync_from_vapic
KVM: s390: Fix assumption that kvm_set_irq_routing is always run successfully
KVM: VMX: drop ept misconfig check
KVM: MMU: fully check zero bits for sptes
KVM: MMU: introduce is_shadow_zero_bits_set()
KVM: MMU: introduce the framework to check zero bits on sptes
KVM: MMU: split reset_rsvds_bits_mask_ept
KVM: MMU: split reset_rsvds_bits_mask
KVM: MMU: introduce rsvd_bits_validate
KVM: MMU: move FNAME(is_rsvd_bits_set) to mmu.c
KVM: MMU: fix validation of mmio page fault
KVM: MTRR: Use default type for non-MTRR-covered gfn before WARN_ON
KVM: s390: host STP toleration for VMs
KVM: x86: clean/fix memory barriers in irqchip_in_kernel
KVM: document memory barriers for kvm->vcpus/kvm->online_vcpus
KVM: x86: remove unnecessary memory barriers for shared MSRs
KVM: move code related to KVM_SET_BOOT_CPU_ID to x86
KVM: s390: log capability enablement and vm attribute changes
...
When kvm_set_msr_common() handles a guest's write to
MSR_IA32_TSC_ADJUST, it will calcuate an adjustment based on the data
written by guest and then use it to adjust TSC offset by calling a
call-back adjust_tsc_offset(). The 3rd parameter of adjust_tsc_offset()
indicates whether the adjustment is in host TSC cycles or in guest TSC
cycles. If SVM TSC scaling is enabled, adjust_tsc_offset()
[i.e. svm_adjust_tsc_offset()] will first scale the adjustment;
otherwise, it will just use the unscaled one. As the MSR write here
comes from the guest, the adjustment is in guest TSC cycles. However,
the current kvm_set_msr_common() uses it as a value in host TSC
cycles (by using true as the 3rd parameter of adjust_tsc_offset()),
which can result in an incorrect adjustment of TSC offset if SVM TSC
scaling is enabled. This patch fixes this problem.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: stable@vger.linux.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>