Otherwise, the core module thinks the arch module is loaded, and won't
let you reload it after you've fixed the bug.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Use the standard magic.h for kvmfs.
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
A bogus 'return r' can cause an otherwise successful module load to fail.
This both denies users the use of kvm, and it also denies them the use of
their machine, as it leaves a filesystem registered with its callbacks
pointing into now-freed module memory.
Fix by returning a zero like a good module.
Thanks to Richard Lucassen <mailinglists@lucassen.org> (?) for reporting
the problem and for providing access to a machine which exhibited it.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Enabling dirty page logging is done using KVM_SET_MEMORY_REGION ioctl.
If the memory region already exists, we need to remove write accesses,
so writes will be caught, and dirty pages will be logged.
Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Since dirty_bitmap is an unsigned long array, the alignment and size need
to take that into account.
Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
A few places where we modify guest memory fail to call mark_page_dirty(),
causing live migration to fail. This adds the missing calls.
Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Allocate a distinct inode for every vcpu in a VM. This has the following
benefits:
- the filp cachelines are no longer bounced when f_count is incremented on
every ioctl()
- the API and internal code are distinctly clearer; for example, on the
KVM_GET_REGS ioctl, there is no need to copy the vcpu number from
userspace and then copy the registers back; the vcpu identity is derived
from the fd used to make the call
Right now the performance benefits are completely theoretical since (a) we
don't support more than one vcpu per VM and (b) virtualization hardware
inefficiencies completely everwhelm any cacheline bouncing effects. But
both of these will change, and we need to prepare the API today.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This reflects the changed scope, from device-wide to single vm (previously
every device open created a virtual machine).
Signed-off-by: Avi Kivity <avi@qumranet.com>
This avoids having filp->f_op and the corresponding inode->i_fop different,
which is a little unorthodox.
The ioctl list is split into two: global kvm ioctls and per-vm ioctls. A new
ioctl, KVM_CREATE_VM, is used to create VMs and return the VM fd.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The kvmfs inodes will represent virtual machines and vcpus, as necessary,
reducing cacheline bouncing due to inodes and filps being shared.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This adds a special MSR based hypercall API to KVM. This is to be
used by paravirtual kernels and virtual drivers.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Besides using an established api, this allows using kvm in older kernels.
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add the necessary callbacks to suspend and resume a host running kvm. This is
just a repeat of the cpu hotplug/unplug work.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On hotplug, we execute the hardware extension enable sequence. On unplug, we
decache any vcpus that last ran on the exiting cpu, and execute the hardware
extension disable sequence.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will allow us to iterate over all vcpus and see which cpus they are
running on.
[akpm@osdl.org: use standard (ugly) initialisers]
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vcpu_load() can return NULL and it sometimes does in failure paths (for
example when the userspace ABI version is too old) - causing a preemption
count underflow in the ->vcpu_free() later on. So check for NULL.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We report the value of cr8 to userspace on an exit. Also let userspace change
cr8 when we re-enter the guest. The lets 64-bit guest code maintain the tpr
correctly.
Thanks for Yaniv Kamay for the idea.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows netbsd 3.1 i386 to get further along installing.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes the vmwrite errors on vm shutdown go away.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prevent the guest's loading of a corrupt cr3 (pointing at no guest phsyical
page) from crashing the host.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fixes oops on early close of /dev/kvm.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The mmu sometimes needs memory for reverse mapping and parent pte chains.
however, we can't allocate from within the mmu because of the atomic context.
So, move the allocations to a central place that can be executed before the
main mmu machinery, where we can bail out on failure before any damage is
done.
(error handling is deffered for now, but the basic structure is there)
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
cmpxchg8b uses edx:eax as the compare operand, not edi:eax.
cmpxchg8b is used by 32-bit pae guests to set page table entries atomically,
and this is emulated touching shadowed guest page tables.
Also, implement it for 32-bit hosts.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since we write protect shadowed guest page tables, there is no need to trap
page invalidations (the guest will always change the mapping before issuing
the invlpg instruction).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A page table may have been recycled into a regular page, and so any
instruction can be executed on it. Unprotect the page and let the cpu do its
thing.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As the mmu write protects guest page table, we emulate those writes. Since
they are not mmio, there is no need to go to userspace to perform them.
So, perform the writes in the kernel if possible, and notify the mmu about
them so it can take the approriate action.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This lets us not write protect a partial page, and is anyway what a real
processor does.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In pae mode, a load of cr3 loads the four third-level page table entries in
addition to cr3 itself.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Keep in each host page frame's page->private a pointer to the shadow pte which
maps it. If there are multiple shadow ptes mapping the page, set bit 0 of
page->private, and use the rest as a pointer to a linked list of all such
mappings.
Reverse mappings are needed because we when we cache shadow page tables, we
must protect the guest page tables from being modified by the guest, as that
would invalidate the cached ptes.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hardware virtualization implementations allow the guests to freely change some
of the bits in cr0 and cr4, but trap when changing the other bits. This is
useful to avoid excessive exits due to changing, for example, the ts flag.
It also means the kvm's copy of cr0 and cr4 may be stale with respect to these
bits. most of the time this doesn't matter as these bits are not very
interesting. Other times, however (for example when returning cr0 to
userspace), they are, so get the fresh contents of these bits from the guest
by means of a new arch operation.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current interrupt injection mechanism might delay an interrupt under
the following circumstances:
- if injection fails because the guest is not interruptible (rflags.IF clear,
or after a 'mov ss' or 'sti' instruction). Userspace can check rflags,
but the other cases or not testable under the current API.
- if injection fails because of a fault during delivery. This probably
never happens under normal guests.
- if injection fails due to a physical interrupt causing a vmexit so that
it can be handled by the host.
In all cases the guest proceeds without processing the interrupt, reducing
the interactive feel and interrupt throughput of the guest.
This patch fixes the situation by allowing userspace to request an exit
when the 'interrupt window' opens, so that it can re-inject the interrupt
at the right time. Guest interactivity is very visibly improved.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If we load the wrong arch module, it leaves behind kvm_arch_ops set, which
prevents loading of the correct arch module later.
Fix be not setting kvm_arch_ops until we're sure it's good.
Signed-off-by: Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fix an GFP_KERNEL allocation in atomic section: kvm_dev_ioctl_create_vcpu()
called kvm_mmu_init(), which calls alloc_pages(), while holding the vcpu.
The fix is to set up the MMU state in two phases: kvm_mmu_create() and
kvm_mmu_setup().
(NOTE: free_vcpus does an kvm_mmu_destroy() call so there's no need for any
extra teardown branch on allocation/init failure here.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
__free_page() doesn't like a NULL argument, so check before calling it. A
NULL can only happen if memory is exhausted during allocation of a memory
slot.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These msrs are referenced by benchmarking software when pretending to be an
Intel cpu.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The latest version of kvm doesn't initialize kvm_arch_ops in kvm_init(), which
causes an error with the following sequence.
1. Load the supported arch's module.
2. Load the unsupported arch's module.$B!!(B(loading error)
3. Unload the unsupported arch's module.
You'll get the following error message after step 3. "BUG: unable to handle
to handle kernel paging request at virtual address xxxxxxxx"
The problem here is that the unsupported arch's module overwrites kvm_arch_ops
of the supported arch's module at step 2.
This patch initializes kvm_arch_ops upon loading architecture specific kvm
module, and prevents overwriting kvm_arch_ops when kvm_arch_ops is already set
correctly.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of doing tricky stuff with the arch dependent virtualization
registers, take a peek at the guest's efer.
This simlifies some code, and fixes some confusion in the mmu branch.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add compile-time and run-time API versioning.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some msrs, such as MSR_STAR, are not available on all processors. Exporting
them causes qemu to try to fetch them, which will fail.
So, check all msrs for validity at module load time.
Signed-off-by: Michael Riepe <michael@mr511.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>