This code has gone to wrong place in the file. Moving it back to
right location.
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
If we defer updating rip until pio instructions are executed, we have a
problem with reset: a pio reset updates rip, and when the instruction
completes we skip the emulated instruction, pointing rip somewhere completely
unrelated.
Fix by updating rip when we see decode the instruction, not after emulation.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Some operand fetches are less than the machine word size and can result in
stale bits if used together with operands of different sizes.
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Implement emulation of instruction
lea r16/r32, m
opcode: 0x8d:
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
According to Intel Software Developer's Manual, Vol. 3B, Appendix H.4.2,
exit qualification should be of natural width. However, current code
uses u64 as the data type for this register, which occasionally
introduces invalid value to VMExit handling logics. This patch fixes
this bug.
I have tested Windows and Linux guest on i386 host, and they can boot
successfully with this patch.
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Before preempt notifiers, kvm needed to allocate memory with GFP_NOWAIT so
as not to have to enable preemption and take a heavyweight exit. On oom, we'd
fall back to a GFP_KERNEL allocation.
With preemption notifiers, we can do a GFP_KERNEL allocation, and perform
the heavyweight exit only if the kernel decides to put us to sleep.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch just renames the current (misnamed) _arch namings to _x86 to
ensure better readability when a real arch layer takes place.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The mutex->splinlock convertion alllows us to make some code simplifications.
As we can keep the lock longer, we don't have to release it and then
have to check if the environment has not been modified before re-taking it. We
can remove kvm->busy and kvm->memory_config_version.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
SVM gets the DB and L bits for the cs by decoding the segment. This
is in fact the completely generic code, so hoist it for kvm-lite to use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
We don't update the vcpu control registers in various places. We
should do so.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
invlpg shouldn't fetch the "src" address, since it may not be valid,
however SVM's "solution" which neuters emulation of all group 7
instruction is horrible and breaks kvm-lite. The simplest fix is to
put a special check in for invlpg.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This was missed when moving stuff around in fbc4f2e
Fixes Solaris guests and bug #1773613
Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch enables INIT/SIPI handling using in-kernel APIC by
introducing a ->mp_state field to emulate the SMP state transition.
[avi: remove smp_processor_id() warning]
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Xin Li <xin.b.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch changes the PIC interrupts delivery. Now it is only delivered
to vcpu0 when either condition is met (on vcpu0):
1. local APIC is hardware disabled
2. LVT0 is unmasked and configured to delivery mode ExtInt
It fixes the 2x faster wall clock on x86_64 and SMP i386 Linux guests
Signed-off-by: Eddie (Yaozu) Dong <eddie.dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This reduces overhead by accessing cachelines from the wrong node, as well
as simplifying locking.
[Qing: fix for inactive or expired one-shot timer]
Signed-off-by: Yaozu (Eddie) Dong <Eddie.Dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
APIC timer IRQ is set every time when a certain period
expires at host time, but the guest may be descheduled
at that time and thus the irq be overwritten by later fire.
This patch keep track of firing irq numbers and decrease
only when the IRQ is injected to guest or buffered in
APIC.
Signed-off-by: Yaozu (Eddie) Dong <Eddie.Dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch enables TPR shadow of VMX on CR8 access. 64bit Windows using
CR8 access TPR frequently. The TPR shadow can improve the performance of
access TPR by not causing vmexit.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds a new vcpu-based IOCTL to save and restore the local
apic registers for a single vcpu. The kernel only copies the apic page as
a whole, extraction of registers is left to userspace side. On restore, the
APIC timer is restarted from the initial count, this introduces a little
delay, but works fine.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds support for in-kernel ioapic save and restore (to
and from userspace). It uses the same get/set_irqchip ioctl as
in-kernel PIC.
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
vcpu->irq_pending is saved in get/set_sreg IOCTL, but when in-kernel
local APIC is used, doing this may occasionally overwrite vcpu->apic to
an invalid value, as in the vm restore path.
Signed-off-by: Qing He <qing.he@intel.com>
This patch adds two new ioctls to dump and write kernel irqchips for
save/restore and live migration. PIC s/r and l/m is implemented in this
patch.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
pio operation and IRQ_LINE kvm_vm_ioctl is not kvm->lock
protected. Add lock to same with IOAPIC MMIO operations.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
By sleeping in the kernel when hlt is executed, we simplify the in-kernel
guest interrupt path considerably.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Because lightweight exits (exits which don't involve userspace) are many
times faster than heavyweight exits, it makes sense to emulate high usage
devices in the kernel. The local APIC is one such device, especially for
Windows and for SMP, so we add an APIC model to kvm.
It also allows in-kernel host-side drivers to inject interrupts without
going through userspace.
[compile fix on i386 from Jindrich Makovicka]
Signed-off-by: Yaozu (Eddie) Dong <Eddie.Dong@intel.com>
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch is to wrap APIC base register and CR8 operation which can
provide a unique API for user level irqchip and kernel irqchip.
This is a preparation of merging lapic/ioapic patch.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
vmx_load_host_state() bundles fs, gs, ldt, and tss reloading into
one in the hope that it is infrequent. With smp guests, fs reloading is
frequent due to fs being used by threads.
Unbundle the reloads so reduce expensive gs reloads.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Implement emulation of instruction
and al imm8 (opcode 0x24)
and ax/eax imm16/imm32 (opcode 0x25)
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
We need to check for signals inside the critical section, otherwise a
signal can be sent which we will not notice. Also move the check
before entry, so that if the signal happens before the first entry,
we exit immediately instead of waiting for something to happen to the
guest.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Split kvm_setup_pio() into two functions, one to setup in/out pio
(kvm_emulate_pio()) and one to setup ins/outs pio (kvm_emulate_pio_string()).
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Both vmx and svm decode the I/O instructions, and both botch the job,
requiring the instruction prefixes to be fetched in order to completely
decode the instruction.
So, if we see a string I/O instruction, use the x86 emulator to decode it,
as it already has all the prefix decoding machinery.
This patch defines ins/outs opcodes in x86_emulate.c and calls
emulate_instruction() from io_interception() (svm.c) and from handle_io()
(vmx.c). It removes all vmx/svm prefix instruction decoders
(get_addr_size(), io_get_override(), io_address(), get_io_count())
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Line 1809 of kvm_main.c is useless, value is overwritten in line 1815:
1809 now = min(count, PAGE_SIZE / size);
1810
1811 if (!down)
1812 in_page = PAGE_SIZE - offset_in_page(address);
1813 else
1814 in_page = offset_in_page(address) + size;
1815 now = min(count, (unsigned long)in_page / size);
1816 if (!now) {
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Remove a duplicated ia32e mode VM Entry control definition and use the
proper one.
Signed-off-by: Xin Li <xin.b.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
We use kfree in svm.c and vmx.c, and this works, but it could break at
any time. kfree() is supposed to match up with kmalloc().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
All guest-invokable printks should be ratelimited to prevent malicious
guests from flooding logs. This is a start.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
We shouldn't define stat_set on the debug attributes, since that will
cause silent failure on writing: without a set argument, userspace
will get -EACCESS.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
move_msr_up() is used only on X86_64 and generates a warning on !X86_64
Signed-off-by: Gabriel Craciunescu <nix.or.die@googlemail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
alloc_vmcs_cpu is already declared (static) above, no need to
redeclare.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
set_msr_interception() is used by svm to set up which MSRs should be
intercepted. It can only fail if someone has changed the code to try
to intercept an MSR without updating the array of ranges.
The return value is ignored anyway: it should just BUG() if it doesn't
work. (A build-time failure would be better, but that's tricky).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
For some reason, mark_page_dirty open-codes __gfn_to_memslot().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
All the physical CPUs on the board should support the same VMX feature
set. Add check_processor_compatibility to kvm_arch_ops for the consistency
check.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
kvm_vm_ioctl_get_dirty_log scans bitmap to see it it's all zero, but
doesn't use that information.
Avi says:
Looks like it was used to guard kvm_mmu_slot_remove_write_access();
optimizing the case where the guest just leaves the screen alone (which
it usually does, especially in benchmarks).
I'd rather reinstate that optimization. See
90cb0529dd where the damage was done.
It's pretty simple: if the bitmap is all zero, we don't need to do anything to
clean it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Now we use a kmem cache for allocating vcpus, we can get the 16-byte
alignment required by fxsave & fxrstor instructions, and avoid
manually aligning the buffer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Avi wants the allocations of vcpus centralized again. The easiest way
is to add a "size" arg to kvm_init_arch, and expose the thus-prepared
cache to the modules.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
... in favor of the more general emulator_{read,write}_*.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
... instead of a x86_emulate_ctxt, so that other callers can use it easily.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Changes some svm.c internal function names:
1) io_adress -> io_address (de-germanify the spelling)
2) kvm_reput_irq -> reput_irq (it's not a generic kvm function)
3) kvm_do_inject_irq -> (it's not a generic kvm function)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
container_of is wonderful, but not casting at all is better. This
patch changes svm.c's internal functions to pass "struct vcpu_svm"
instead of "struct kvm_vcpu" and using container_of.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
There are several places where hardcoded numbers are used in place of
the easily-available constant, which is poor form.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
container_of is wonderful, but not casting at all is better. This
patch changes vmx.c's internal functions to pass "struct vcpu_vmx"
instead of "struct kvm_vcpu" and using container_of.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Now that kvm generally runs with preemption enabled, we need to protect
the fpu intialization sequence.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This allows the kvm mmu to perform sleepy operations, such as memory
allocation.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Current kvm disables preemption while the new virtualization registers are
in use. This of course is not very good for latency sensitive workloads (one
use of virtualization is to offload user interface and other latency
insensitive stuff to a container, so that it is easier to analyze the
remaining workload). This patch re-enables preemption for kvm; preemption
is now only disabled when switching the registers in and out, and during
the switch to guest mode and back.
Contains fixes from Shaohua Li <shaohua.li@intel.com>.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add the hypercall number to kvm_run and initialize it. This changes the ABI,
but as this particular ABI was unusable before this no users are affected.
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Put cpu feature detecting part in hardware_setup, and stored the vmcs
condition in global variable for further check.
[glommer: fix for some i386-only machines not supporting CR8 load/store
exiting]
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch converts the vcpus array in "struct kvm" to a pointer
array, and changes the "vcpu_create" and "vcpu_setup" hooks into one
"vcpu_create" call which does the allocation and initialization of the
vcpu (calling back into the kvm_vcpu_init core helper).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
struct kvm_vcpu has vmx-specific members; remove them to a private structure.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
load_pdptrs can be handed an invalid cr3, and it should not oops.
This can happen because we injected #gp in set_cr3() after we set
vcpu->cr3 to the invalid value, or from kvm_vcpu_ioctl_set_sregs(), or
memory configuration changes after the guest did set_cr3().
We should also copy the pdpte array once, before checking and
assigning, otherwise an SMP guest can potentially alter the values
between the check and the set.
Finally one nitpick: ret = 1 should be done as late as possible: this
allows GCC to check for unset "ret" should the function change in
future.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The writeback fixes (02c03a326a) let
some dead code in the cmpxchg instruction emulation. Remove it.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch mainly imports some constants and rename two exist constants
of vmcs according to IA32 SDM.
It also adds two constants to indicate Lock bit and Enable bit in
MSR_IA32_FEATURE_CONTROL, and replace the hardcode _5_ with these two
bits.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
gfn_to_page might sleep with swap support. Move it out of the kmap calls.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
vmx_cpu_run doesn't handle error correctly and kvm_mmu_reload might
sleep with mutex changes, so I move it above.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Right now, the bug is harmless as we never emulate one-byte 0xb6 or 0xb7.
But things may change.
Noted by the mysterious Gabriel C.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Intel manual (and KVM definition) say the TPR is 4 bits wide. Also fix
CR8_RESEVED_BITS typo.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Creating one's own BITMAP macro seems suboptimal: if we use manual
arithmetic in the one place exposed to userspace, we can use standard
macros elsewhere.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
On this machine (Intel), writing to the CR4 bits 0x00000800 and
0x00001000 cause a GPF. The Intel manual is a little unclear, but
AFIACT they're reserved, too.
Also fix spelling of CR4_RESEVED_BITS.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The kernel now has asm/cpu-features.h: use those macros instead of inventing
our own.
Also spell out definition of CR3_RESEVED_BITS, fix spelling and
tighten it for the non-PAE case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The kernel now has asm/cpu-features.h: use those macros instead of
inventing our own.
Also spell out definition of CR0_RESEVED_BITS (no code change) and fix typo.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
I have shied away from touching x86_emulate.c (it could definitely use
some love, but it is forked from the Xen code, and it would be more
productive to cross-merge fixes).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add string pio write support to support some version of Windows.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds a `vcpu_id' field in `struct vcpu', so we can
differentiate BSP and APs without pointer comparison or arithmetic.
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
*nopage() in kvm_main.c should only store the type of mmap() fault if
the pointers are not NULL. This patch fixes the problem.
Signed-off-by: Nguyen Anh Quynh <aquynh@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
What guest drivers?
Cc: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A guest context switch to an uncached cr3 can require allocation of
shadow pages, but we only recycle shadow pages in kvm_mmu_page_fault().
Move shadow page recycling to mmu_topup_memory_caches(), which is called
from both the page fault handler and from guest cr3 reload.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When taking a cpu down, we need to hardware_disable() it.
Unfortunately, the CPU_DYING notifier is called with interrupts
disabled, which means we can't use smp_call_function_single().
Fortunately, the CPU_DYING notifier is always called on the dying cpu,
so we don't need to use the function at all and can simply call
hardware_disable() directly.
Tested-by: Paolo Ornati <ornati@fastwebnet.it>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
More fallout from the writeback fixes: debug register transfer
instructions do their own writeback and thus need to disable the general
writeback mechanism.
This fixes oopses and some guest failures on AMD machines (the Intel
variant decodes the instruction in hardware and thus does not need
emulation).
Cc: Alistair John Strachan <alistair@devzero.co.uk>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0x0f 0x01 instructions (ie lgdt, lidt, smsw, lmsw and invlpg) does
not use writeback. This patch set no_wb=1 when emulating those
instructions.
This fixes a regression booting the FreeBSD kernel on AMD.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Testing the wrong bit caused kvm not to disable nx on the guest when it is
disabled on the host (an mmu optimization relies on the nx bits being the
same in the guest and host).
This allows Windows to boot when nx is disabled on te host (e.g. when
host pae is disabled).
Signed-off-by: Avi Kivity <avi@qumranet.com>
This reverts commit a3c870bdce. While it
does save useless updates, it (probably) defeats the fork detector, causing
a massive performance loss.
Signed-off-by: Avi Kivity <avi@qumranet.com>
We add the kvm to the vm_list before initializing the vcpu mutexes,
which can be mutex_trylock()'ed by decache_vcpus_on_cpu().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Writes that are contiguous in virtual memory may not be contiguous in
physical memory; so split writes that straddle a page boundary.
Thanks to Aurelien for reporting the bug, patient testing, and a fix
to this very patch.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
__free_page() wants a struct page, not a virtual address.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kvm mmu uses page->private on shadow page tables; so does slub, and
an oops result. Fix by allocating regular pages for shadows instead of
using slub.
Tested-by: S.Çağlar Onur <caglar@pardus.org.tr>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Allow real-mode emulation of rdmsr and wrmsr. This allows smp Windows to
boot, presumably for its sipi trampoline.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The memory slot management functions were oriented against vcpu 0, where
they should be kvm-wide. This causes hangs starting X on guest smp.
Fix by making the functions (and resultant tail in the mmu) non-vcpu-specific.
Unfortunately this reduces the efficiency of the mmu object cache a bit. We
may have to revisit this later.
Signed-off-by: Avi Kivity <avi@qumranet.com>
We need to distinguish between large page shadows which have the nx bit set
and those which don't. The problem shows up when booting a newer smp Linux
kernel, where the trampoline page (which is in real mode, which uses the
same shadow pages as large pages) is using the same mapping as a kernel data
page, which is mapped using nx, causing kvm to spin on that page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.
This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Currently, CONFIG_X86_CMPXCHG64 both enables boot-time checking of
the cmpxchg64b feature and enables compilation of the set_64bit() family.
Since the option is dependent on PAE, and since KVM depends on set_64bit(),
this effectively disables KVM on i386 nopae.
Simplify by removing the config option altogether: the boot check is made
dependent on CONFIG_X86_PAE directly, and the set_64bit() family is exposed
without constraints. It is up to users to check for the feature flag (KVM
does not as virtualiation extensions imply its existence).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only at the CPU_DYING stage can we be sure that no user process will
be scheduled onto the cpu and oops when trying to use virtualization
extensions.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The hotplug IPIs can be called from the cpu on which we are currently
running on, so use on_cpu(). Similarly, drop on_each_cpu() for the
suspend/resume callbacks, as we're in atomic context here and only one
cpu is up anyway.
Signed-off-by: Avi Kivity <avi@qumranet.com>
By keeping track of which cpus have virtualization enabled, we
prevent double-enable or double-disable during hotplug, which is a
very fatal oops.
Signed-off-by: Avi Kivity <avi@qumranet.com>
kvm uses a pseudo filesystem, kvmfs, to generate inodes, a job that the
new anonymous inodes source does much better.
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds an implementation to the svm is_disabled function to
detect reliably if the BIOS disabled the SVM feature in the CPU. This
fixes the issues with kernel panics when loading the kvm-amd module on
machines where SVM is available but disabled.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Protected mode code may have corrupted the real-mode tss, so re-initialize
it when switching to real mode.
Signed-off-by: Avi Kivity <avi@qumranet.com>
When writing to normal memory and the memory area is unchanged the write
can be safely skipped, avoiding the costly kvm_mmu_pte_write.
Signed-Off-By: Luca Tettamanti <kronos.it@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
When the old value and new one are the same the emulator skips the
write; this is undesirable when the destination is a MMIO area and the
write shall be performed regardless of the previous value. This
optimization breaks e.g. a Linux guest APIC compiled without
X86_GOOD_APIC.
Remove the check and perform the writeback stage in the emulation unless
it's explicitly disabled (currently push and some 2 bytes instructions
may disable the writeback).
Signed-Off-By: Luca Tettamanti <kronos.it@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
With kernel-injected interrupts, we need to check for interrupts on
lightweight exits too.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
If the time stamp counter goes backwards, a guest delay loop can become
infinite. This can happen if a vcpu is migrated to another cpu, where
the counter has a lower value than the first cpu.
Since we're doing an IPI to the first cpu anyway, we can use that to pick
up the old tsc, and use that to calculate the adjustment we need to make
to the tsc offset.
Signed-off-by: Avi Kivity <avi@qumranet.com>
When a vcpu causes a shadow tlb entry to have reduced permissions, it
must also clear the tlb on remote vcpus. We do that by:
- setting a bit on the vcpu that requests a tlb flush before the next entry
- if the vcpu is currently executing, we send an ipi to make sure it
exits before we continue
Signed-off-by: Avi Kivity <avi@qumranet.com>
A vcpu can pin up to four mmu shadow pages, which means the freeing
loop will never terminate. Fix by first unpinning shadow pages on
all vcpus, then freeing shadow pages.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Switch guest paging context may require us to allocate memory, which
might fail. Instead of wiring up error paths everywhere, make context
switching lazy and actually do the switch before the next guest entry,
where we can return an error if allocation fails.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This was once used to avoid accessing the guest pte when upgrading
the shadow pte from read-only to read-write. But usually we need
to set the guest pte dirty or accessed bits anyway, so this wasn't
really exploited.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Always set the accessed and dirty bit (since having them cleared causes
a read-modify-write cycle), always set the present bit, and copy the
nx bit from the guest.
Signed-off-by: Avi Kivity <avi@qumranet.com>
With guest smp, a second vcpu might see partial updates when the first
vcpu services a page fault. So delay all updates until we have figured
out what the pte should look like.
Note that on i386, this is still not completely atomic as a 64-bit write
will be split into two on a 32-bit machine.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This prevents some work from being performed twice, and, more importantly,
reduces the number of places where we modify shadow ptes.
Signed-off-by: Avi Kivity <avi@qumranet.com>
We will need the accessed bit (in addition to the dirty bit) and
also write access (for setting the dirty bit) in a future patch.
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM compilation fails for some .configs. This fixes it.
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Make a "menuconfig" out of the Kconfig objects "menu, ..., endmenu",
so that the user can disable all the options in that menu at once
instead of having to disable each option separately.
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
MSR_EFER.LME/LMA bits are automatically save/restored by VMX
hardware, KVM only needs to save NX/SCE bits at time of heavy
weight VM Exit. But clearing NX bits in host envirnment may
cause system hang if the host page table is using EXB bits,
thus we leave NX bits as it is. If Host NX=1 and guest NX=0, we
can do guest page table EXB bits check before inserting a shadow
pte (though no guest is expecting to see this kind of gp fault).
If host NX=0, we present guest no Execute-Disable feature to guest,
thus no host NX=0, guest NX=1 combination.
This patch reduces raw vmexit time by ~27%.
Me: fix compile warnings on i386.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
In a lightweight exit (where we exit and reenter the guest without
scheduling or exiting to userspace in between), we don't need various
msrs on the host, and avoiding shuffling them around reduces raw exit
time by 8%.
i386 compile fix by Daniel Hecken <dh@bahntechnik.de>.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instructions with address size override prefix opcode 0x67
Cause the #SS fault with 0 error code in VM86 mode. Forward
them to the emulator.
Signed-Off-By: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
kunmap() expects a struct page, not a virtual address. Fixes an oops loading
kvm-intel.ko on i386 with CONFIG_HIGHMEM.
Thanks to Michael Ivanov <deruhu@peterstar.ru> for reporting.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The real mode tr needs to be set to a specific tss so that I/O
instructions can function. Divert the new tr values to the real
mode save area from where they will be restored on transition to
protected mode.
This fixes some crashes on reboot when the bios accesses an I/O
instruction.
Signed-off-by: Avi Kivity <avi@qumranet.com>
If we set an msr via an ioctl() instead of by handling a guest exit, we
have the host state loaded, so reloading the msrs would clobber host
state instead of guest state.
This fixes a host oops (and loss of a cpu) on a guest reboot.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Attempting to boot the default 'bsd' kernel of OpenBSD 4.1 i386 in a guest
fails early in the kernel init inside p3_get_bus_clock while trying to read
the IA32_EBL_CR_POWERON MSR. KVM logs an 'unhandled MSR' message and the
guest kernel faults.
This patch is sufficient to allow OpenBSD to boot, after which it seems to
run fine. I'm not sure if this is the correct solution for dealing with
this particular MSR, but it works for me.
Signed-off-by: Matthew Gregan <kinetik@flim.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Everyone owns a piece of the exception bitmap, but they happily write to
the entire thing like there's no tomorrow. Centralize handling in
update_exception_bitmap() and have everyone call that.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The lightweight vmexit path avoids saving and reloading certain host
state. However in certain cases lightweight vmexit handling can schedule()
which requires reloading the host state.
So we store the host state in the vcpu structure, and reloaded it if we
relinquish the vcpu.
Signed-off-by: Avi Kivity <avi@qumranet.com>
A typical demand page/copy on write pattern is:
- page fault on vaddr
- kvm propagates fault to guest
- guest handles fault, updates pte
- kvm traps write, clears shadow pte, resumes guest
- guest returns to userspace, re-faults on same vaddr
- kvm installs shadow pte, resumes guest
- guest continues
So, three vmexits for a single guest page fault. But if instead of clearing
the page table entry, we update to correspond to the value that the guest
has just written, we eliminate the third vmexit.
This patch does exactly that, reducing kbuild time by about 10%.
Signed-off-by: Avi Kivity <avi@qumranet.com>
When a guest writes to a page that has an mmu shadow, we have to clear
the shadow pte corresponding to the memory location touched by the guest.
Now, in nonpae mode, a single guest page may have two or four shadow
pages (because a nonpae page maps 4MB or 4GB, whereas the pae shadow maps
2MB or 1GB), so we when we look up the page we find up to three additional
aliases for the page. Since we _clear_ the shadow pte, it doesn't matter
except for a slight performance penalty, but if we want to _update_ the
shadow pte instead of clearing it, it is vital that we don't modify the
aliases.
Fortunately, exactly which page is needed (the "quadrant") is easily
computed, and is accessible in the shadow page header. All we need is
to ignore shadow pages from the wrong quadrants.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of calling two functions and repeating expensive checks, call one
function and provide it with before/after information.
Signed-off-by: Avi Kivity <avi@qumranet.com>
i386 wants fs for accessing the pda even on a lightweight exit, so ensure
we can always restore it. This fixes a regression on i386 introduced by
the lightweight vmexit patch.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The kvm mmu tries to detects forks by looking for repeated writes to a
page table. If it sees a fork, it unshadows the page table so the page
table copying can proceed at native speed instead of being emulated.
However, the detector also triggered on simple demand paging access patterns:
a linear walk of memory would of course cause repeated writes to the same
pagetable page, causing it to unshadow prematurely.
Fix by resetting the fork detector if we detect a demand fault.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Many msrs and the like will only be used by the host if we schedule() or
return to userspace. Therefore, we avoid saving them if we handle the
exit within the kernel, and if a reschedule is not requested.
Based on a patch from Eddie Dong <eddie.dong@intel.com> with a couple of
fixes by me.
Signed-off-by: Yaozu(Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This allows us to remove write protection earlier than otherwise. Should
some mad OS choose to use byte writes to update pagetables, it will suffer
a performance hit, but still work correctly.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The PC debug port is used for IO delay and does not require emulation.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch enables IO bitmaps control on vmx and unmask the 0x80 port to
avoid VMEXITs caused by accessing port 0x80. 0x80 is used as delays (see
include/asm/io.h), and handling VMEXITs on its access is unnecessary but
slows things down. This patch improves kernel build test at around
3%~5%.
Because every VM uses the same io bitmap, it is shared between
all VMs rather than a per-VM data structure.
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The lazy fpu changes did not take into account that some vmexit handlers
can sleep. Move loading the guest state into the inner loop so that it
can be reloaded if necessary, and move loading the host state into
vmx_vcpu_put() so it can be performed whenever we relinquish the vcpu.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Fix following section mismatch warning in kvm-intel.o:
WARNING: o-i386/drivers/kvm/kvm-intel.o(.init.text+0xbd): Section mismatch: reference to .exit.text: (between 'hardware_setup' and 'vmx_disabled_by_bios')
The function free_kvm_area is used in the function alloc_kvm_area which
is marked __init.
The __exit area is discarded by some archs during link-time if a
module is built-in resulting in an oops.
Note: This warning is only seen by my local copy of modpost
but the change will soon hit upstream.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Refine some depends statements to limit their visibility to the
environments that are actually supported.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>