The tracepoint is only used to audit mmu code, it should not be exposed to
user, let us replace it with jump-label.
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Export these two symbols, they will be used by KVM mmu audit
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch cleans and simplifies kvm_dev_ioctl_get_supported_cpuid by using a table
instead of duplicating code as Avi suggested.
This patch also fixes a bug where kvm_dev_ioctl_get_supported_cpuid would return
-E2BIG when amount of entries passed was just right.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Intel latest cpu add 6 new features, refer http://software.intel.com/file/36945
The new feature cpuid listed as below:
1. FMA CPUID.EAX=01H:ECX.FMA[bit 12]
2. MOVBE CPUID.EAX=01H:ECX.MOVBE[bit 22]
3. BMI1 CPUID.EAX=07H,ECX=0H:EBX.BMI1[bit 3]
4. AVX2 CPUID.EAX=07H,ECX=0H:EBX.AVX2[bit 5]
5. BMI2 CPUID.EAX=07H,ECX=0H:EBX.BMI2[bit 8]
6. LZCNT CPUID.EAX=80000001H:ECX.LZCNT[bit 5]
This patch expose these features to guest.
Among them, FMA/MOVBE/LZCNT has already been defined, MOVBE/LZCNT has
already been exposed.
This patch defines BMI1/AVX2/BMI2, and exposes FMA/BMI1/AVX2/BMI2 to guest.
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
INSB : 6C
INSW/INSD : 6D
OUTSB : 6E
OUTSW/OUTSD: 6F
The I/O port address is read from the DX register when we decode the
operand because we see the SrcDX/DstDX flag is set.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the
ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write
consistent with ioapic_mmio_read, which also allows byte and word accesses.
Signed-off-by: Julian Stecklina <js@alien8.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
There is the same struct definition in ia64 and kvm common code:
arch/ia64/kvm//kvm-ia64.c: At top level:
arch/ia64/kvm//kvm-ia64.c:777:8: error: redefinition of ‘struct kvm_io_range’
include/linux/kvm_host.h:62:8: note: originally defined here
So, rename kvm_io_range to kvm_ia64_io_range in ia64 code
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The operation of getting dirty log is frequent when framebuffer-based
displays are used(for example, Xwindow), so, we introduce a mapping table
to speed up id_to_memslot()
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Sort memslots base on its size and use line search to find it, so that the
larger memslots have better fit
The idea is from Avi
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce id_to_memslot to get memslot by slot id
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce kvm_for_each_memslot to walk all valid memslot
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce update_memslots to update slot which will be update to
kvm->memslots
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
vmx_load_host_state() does not handle msrs switching (except
MSR_KERNEL_GS_BASE) since commit 26bb0981b3. Remove call to it
where it is no longer make sense.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently, write protecting a slot needs to walk all the shadow pages
and checks ones which have a pte mapping a page in it.
The walk is overly heavy when dirty pages in that slot are not so many
and checking the shadow pages would result in unwanted cache pollution.
To mitigate this problem, we use rmap_write_protect() and check only
the sptes which can be reached from gfns marked in the dirty bitmap
when the number of dirty pages are less than that of shadow pages.
This criterion is reasonable in its meaning and worked well in our test:
write protection became some times faster than before when the ratio of
dirty pages are low and was not worse even when the ratio was near the
criterion.
Note that the locking for this write protection becomes fine grained.
The reason why this is safe is descripted in the comments.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Needed for the next patch which uses this number to decide how to write
protect a slot.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
rmap_write_protect() calls gfn_to_rmap() for each level with gfn fixed.
This results in calling gfn_to_memslot() repeatedly with that gfn.
This patch introduces __gfn_to_rmap() which takes the slot as an
argument to avoid this.
This is also needed for the following dirty logging optimization.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Remove redundant checks and use is_large_pte() macro.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use kmemdup rather than duplicating its implementation
The semantic patch that makes this change is available
in scripts/coccinelle/api/memdup.cocci.
More information about semantic patching is available at
http://coccinelle.lip6.fr/
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The host side pv mmu support has been marked for feature removal in
January 2011. It's not in use, is slower than shadow or hardware
assisted paging, and a maintenance burden. It's November 2011, time to
remove it.
Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This has not been used for some years now. It's time to remove it.
Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
My testing version of Smatch complains that addr and len come from
the user and they can wrap. The path is:
-> kvm_vm_ioctl()
-> kvm_vm_ioctl_unregister_coalesced_mmio()
-> coalesced_mmio_in_range()
I don't know what the implications are of wrapping here, but we may
as well fix it, if only to silence the warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The vcpu reference of a kvm_timer can't become NULL while the timer is
valid, so drop this redundant test. This also makes it pointless to
carry a separate __kvm_timer_fn, fold it into kvm_timer_fn.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The kvm_host struct can include an mmu_notifier struct but mmu_notifier.h is
not included directly.
Signed-off-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough
Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_mmu_pte_write is too long, we split it for better readable
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling
kvm_mmu_free_some_pages is really unnecessary
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Fast prefetch spte for the unsync shadow page on invlpg path
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Remove the same code between emulator_pio_in_emulated and
emulator_pio_out_emulated
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly
The idea is from Avi
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The idea is from Avi:
| tag instructions that are typically used to modify the page tables, and
| drop shadow if any other instruction is used.
| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
| and cmpxchg8b.
This patch is used to tag the instructions and in the later path, shadow page
is dropped if it is written by other instructions
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".
Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.
In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.
Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.
Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.
Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.
We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.
Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.
On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
RC6 fails again.
> I found my system freeze mostly during starting up X and KDE. Sometimes it
> works for some minutes, sometimes it freezes immediatly. When the freeze
> happens, everything is dead (even the reset button does not work, I need to
> power cycle).
> I disabled RC6, and my system runs wonderfully.
> The system is a Z68 Pro board with Sandybridge i5-2500K processor, 8
> GB of RAM and UEFI firmware.
Reported-by: Kai Krakow <hurikhan77@gmail.com>
Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Semaphores still cause problems on some machines:
> From Udo Steinberg:
>
> With Linux-3.2-rc6 I'm frequently seeing GPU hangs when large amounts of
> text scroll in an xterm, such as when extracting a tar archive. Such as this
> one (note the timestamps):
>
> I can reproduce it fairly easily with something
> as simple as:
>
> while true; do dmesg; done
This patch turns them off on SNB while leaving them on for IVB.
Reported-by: Udo Steinberg <udo@hypervisor.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Eugeni Dodonov <eugeni@dodonov.net>
Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: PPC: e500: include linux/export.h
KVM: PPC: fix kvmppc_start_thread() for CONFIG_SMP=N
KVM: PPC: protect use of kvmppc_h_pr
KVM: PPC: move compute_tlbie_rb to book3s_64 common header
KVM: Don't automatically expose the TSC deadline timer in cpuid
KVM: Device assignment permission checks
KVM: Remove ability to assign a device without iommu support
KVM: x86: Prevent starting PIT timers in the absence of irqchip support
Bruce Fields notes that commit 778fc546f7 ("locks: fix tracking of
inprogress lease breaks") introduced a possible error pointer
dereference on failure to allocate memory. locks_conflict() will
dereference the passed-in new lease lock structure that may be an error pointer.
This means an open (without O_NONBLOCK set) on a file with a lease
applied (generally only done when Samba or nfsd (with v4) is running)
could crash if a kmalloc() fails.
So instead of playing games with IS_ERROR() all over the place, just
check the allocation failure early. That makes the code more
straightforward, and avoids this possible bad pointer dereference.
Based-on-patch-by: J. Bruce Fields <bfields@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>