Changes include:
- Using the common sysreg definitions between KVM and arm64
- Improved hyp-stub implementation with support for kexec and kdump on the 32-bit side
- Proper PMU exception handling
- Performance improvements of our GIC handling
- Support for irqchip in userspace with in-kernel arch-timers and PMU support
- A fix for a race condition in our PSCI code
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJY/IasAAoJEEtpOizt6ddyd7gH/2N3BIMxi/Uqigx0e0byA43s
f+8gNq8A71VBTERGW2l9QP1/AZAXpQYNWdWmN2jn+91x2yoVL7AT00gEsliSLEZv
tqZaTGFXKi1vNihYrxEWm1mfVNzhRrnbW6vjLrO4J5Advq7T3OWhNuVt2BLTxz3Y
h0iqOWNVrUD9h3QSBFH8tz7yXhguDTSppAcXbE0tACdRu4vN50wqEWokHJG5TsMG
Tl3KYWrcc3YCKlAJGuJi7t5rMrXk+g1q6HnxlIN6OSk0POC2Vmw9/Gigtltj1Qwh
ZEAwsnka/U8ak8WaWeZa3EsGTSFSoAk/+pKv2FB8mFN+uOmWDqVlEiol4dW49AY=
=mEOk
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/ARM Changes for v4.12.
Changes include:
- Using the common sysreg definitions between KVM and arm64
- Improved hyp-stub implementation with support for kexec and kdump on the 32-bit side
- Proper PMU exception handling
- Performance improvements of our GIC handling
- Support for irqchip in userspace with in-kernel arch-timers and PMU support
- A fix for a race condition in our PSCI code
Conflicts:
Documentation/virtual/kvm/api.txt
include/uapi/linux/kvm.h
According to the Intel SDM, "Certain exceptions have priority over VM
exits. These include invalid-opcode exceptions, faults based on
privilege level*, and general-protection exceptions that are based on
checking I/O permission bits in the task-state segment (TSS)."
There is no need to check for faulting conditions that the hardware
has already checked.
* These include faults generated by attempts to execute, in
virtual-8086 mode, privileged instructions that are not recognized
in that mode.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
on hflags is reverted later on in x86_emulate_instruction where hflags are
overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.
Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
an instruction is emulated, this commit deletes emul_flags altogether and
makes the emulator access vcpu->arch.hflags using two new accessors. This
way all changes, on the emulator side as well as in functions called from
the emulator and accessing vcpu state with emul_to_vcpu, are preserved.
More details on the bug and its manifestation with Windows and OVMF:
It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
I believe that the SMM part explains why we started seeing this only with
OVMF.
KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
later on in x86_emulate_instruction we overwrite arch.hflags with
ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
The AMD-specific hflag of interest here is HF_NMI_MASK.
When rebooting the system, Windows sends an NMI IPI to all but the current
cpu to shut them down. Only after all of them are parked in HLT will the
initiating cpu finish the restart. If NMI is masked, other cpus never get
the memo and the initiating cpu spins forever, waiting for
hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.
Fixes: a584539b24 ("KVM: x86: pass the whole hflags field to emulator and back")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_make_all_requests() provides a synchronization that waits until all
kicked VCPUs have acknowledged the kick. This is important for
KVM_REQ_MMU_RELOAD as it prevents freeing while lockless paging is
underway.
This patch adds the synchronization property into all requests that are
currently being used with kvm_make_all_requests() in order to preserve
the current behavior and only introduce a new framework. Removing it
from requests where it is not necessary is left for future patches.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No need to kick a VCPU that we have just woken up.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_vcpu_kick() must issue a general memory barrier prior to reading
vcpu->mode in order to ensure correctness of the mutual-exclusion
memory barrier pattern used with vcpu->requests. While the cmpxchg
called from kvm_vcpu_kick():
kvm_vcpu_kick
kvm_arch_vcpu_should_kick
kvm_vcpu_exiting_guest_mode
cmpxchg
implies general memory barriers before and after the operation, that
implication is only valid when cmpxchg succeeds. We need an explicit
barrier for when it fails, otherwise a VCPU thread on its entry path
that reads zero for vcpu->requests does not exclude the possibility
the requesting thread sees !IN_GUEST_MODE when it reads vcpu->mode.
kvm_make_all_cpus_request already had a barrier, so we remove it, as
now it would be redundant.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We want to have kvm_make_all_cpus_request() to be an optmized version of
kvm_for_each_vcpu(i, vcpu, kvm) {
kvm_make_request(vcpu, request);
kvm_vcpu_kick(vcpu);
}
and kvm_vcpu_kick() wakes up the target vcpu. We know which requests do
not need the wake up and use it to optimize the loop.
Thanks to that, this patch doesn't change the behavior of current users
(the all don't need the wake up) and only prepares for future where the
wake up is going to be needed.
I think that most requests do not need the wake up, so we would flip the
bit then.
Later on, kvm_make_request() will take care of kicking too, using this
bit to make the decision whether to kick or not.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some operations must ensure that the guest is not running with stale
data, but if the guest is halted, then the update can wait until another
event happens. kvm_make_all_requests() currently doesn't wake up, so we
can mark all requests used with it.
First 8 bits were arbitrarily reserved for request numbers.
Most uses of requests have the request type as a constant, so a compiler
will optimize the '&'.
An alternative would be to have an inline function that would return
whether the request needs a wake-up or not, but I like this one better
even though it might produce worse assembly.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The #ifndef was protecting a missing halt_wakeup stat, but that is no
longer necessary.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Users were expected to use kvm_check_request() for testing and clearing,
but request have expanded their use since then and some users want to
only test or do a faster clear.
Make sure that requests are not directly accessed with bit operations.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Detect all function codes for KMA and export the features
for use in the cpu model
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJZAJMcAAoJEBF7vIC1phx8KhYP/3mLY9I/SIJDNDge2Jyu59Mf
8VH6UhwViVV2xq0IJvQFuMzlJZ7CoIrdPGTg+mCkzl14fseJOXsS70H4S+stq/nQ
CtMPe5SHz5TkKa4OMcFuR4dLk4bWNt+eNwsfe8fy5FAihTSuQXslOAmE7f7kWIWn
kRzzAoK595AtCnpdO2tqnJHLCby9OxLVAfg58vCnILKrb7PfU8pK5QVQ9LJPNxLX
L0Sf0jWkv+cAGlxD4Z5C/So1YRYjVsHDfQUaExQNiL4jlYJTuOOQ9jD2dZ1SPfWF
oImEeIc9ykDaLEG6fsBvqOWPMSVmA4ErppXn9Zq11NfUCxVBHI1zerJzfI5RrU7w
+IkoCHkOmlzUY/OgixejcVyh48iMrmgTbpyI/idhgGNgCHjZ5E2AcFNfxJW4n7Sj
XvuUvpi9RLddvySdAnYVE+y+KJgPxc33Rkw1+g02YrkQxvT2JVmGA9sgFA/wN5zM
rNtwXpi9lY8uza4/AGUCNN3+lmZW/cMCWA33oApIjVogFwJikafrpqNlNv3clkiw
EZKgaKnPlSCGJey+5SNxNGcPnip9OT/zIbRtpyFvLmeaYaABiVIJ2unSfdPaAyMn
zW6grB+ZzSoKnN514k5rw9smEYUOYYtnVkYEdTtq6LBnRRptR1S/bVkQt2v1aTRn
ZP3nJm4CZ+SUjioTWHon
=elz2
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-4.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: MSA8 feature for guests
- Detect all function codes for KMA and export the features
for use in the cpu model
msa6 and msa7 require no changes.
msa8 adds kma instruction and feature area.
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Provide a kma instruction definition for use by callers of __cpacf_query.
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The new KMA instruction requires unique parameters. Update __cpacf_query to
generate a compatible assembler instruction.
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Acked-by: Harald Freudenberger <freude@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The disablement of interrupts at KVM_SET_CLOCK/KVM_GET_CLOCK
attempts to disable software suspend from causing "non atomic behaviour" of
the operation:
Add a helper function to compute the kernel time and convert nanoseconds
back to CPU specific cycles. Note that these must not be called in preemptible
context, as that would mean the kernel could enter software suspend state,
which would cause non-atomic operation.
However, assume the kernel can enter software suspend at the following 2 points:
ktime_get_ts(&ts);
1.
hypothetical_ktime_get_ts(&ts)
monotonic_to_bootbased(&ts);
2.
monotonic_to_bootbased() should be correct relative to a ktime_get_ts(&ts)
performed after point 1 (that is after resuming from software suspend),
hypothetical_ktime_get_ts()
Therefore it is also correct for the ktime_get_ts(&ts) before point 1,
which is
ktime_get_ts(&ts) = hypothetical_ktime_get_ts(&ts) + time-to-execute-suspend-code
Note CLOCK_MONOTONIC does not count during suspension.
So remove the irq disablement, which causes the following warning on
-RT kernels:
With this reasoning, and the -RT bug that the irq disablement causes
(because spin_lock is now a sleeping lock), remove the IRQ protection as it
causes:
[ 1064.668109] in_atomic(): 0, irqs_disabled(): 1, pid: 15296, name:m
[ 1064.668110] INFO: lockdep is turned off.
[ 1064.668110] irq event stamp: 0
[ 1064.668112] hardirqs last enabled at (0): [< (null)>] )
[ 1064.668116] hardirqs last disabled at (0): [] c0
[ 1064.668118] softirqs last enabled at (0): [] c0
[ 1064.668118] softirqs last disabled at (0): [< (null)>] )
[ 1064.668121] CPU: 13 PID: 15296 Comm: qemu-kvm Not tainted 3.10.0-1
[ 1064.668121] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 5
[ 1064.668123] ffff8c1796b88000 00000000afe7344c ffff8c179abf3c68 f3
[ 1064.668125] ffff8c179abf3c90 ffffffff930ccb3d ffff8c1b992b3610 f0
[ 1064.668126] 00007ffc1a26fbc0 ffff8c179abf3cb0 ffffffff9375f694 f0
[ 1064.668126] Call Trace:
[ 1064.668132] [] dump_stack+0x19/0x1b
[ 1064.668135] [] __might_sleep+0x12d/0x1f0
[ 1064.668138] [] rt_spin_lock+0x24/0x60
[ 1064.668155] [] __get_kvmclock_ns+0x36/0x110 [k]
[ 1064.668159] [] ? futex_wait_queue_me+0x103/0x10
[ 1064.668171] [] kvm_arch_vm_ioctl+0xa2/0xd70 [k]
[ 1064.668173] [] ? futex_wait+0x1ac/0x2a0
v2: notice get_kvmclock_ns with the same problem (Pankaj).
v3: remove useless helper function (Pankaj).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Guests that are heavy on futexes end up IPI'ing each other a lot. That
can lead to significant slowdowns and latency increase for those guests
when running within KVM.
If only a single guest is needed on a host, we have a lot of spare host
CPU time we can throw at the problem. Modern CPUs implement a feature
called "MWAIT" which allows guests to wake up sleeping remote CPUs without
an IPI - thus without an exit - at the expense of never going out of guest
context.
The decision whether this is something sensible to use should be up to the
VM admin, so to user space. We can however allow MWAIT execution on systems
that support it properly hardware wise.
This patch adds a CAP to user space and a KVM cpuid leaf to indicate
availability of native MWAIT execution. With that enabled, the worst a
guest can do is waste as many cycles as a "jmp ." would do, so it's not
a privilege problem.
We consciously do *not* expose the feature in our CPUID bitmap, as most
people will want to benefit from sleeping vCPUs to allow for over commit.
Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
[agraf: fix amd, change commit message]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hardware support for faulting on the cpuid instruction is not required to
emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
cpuid-induced VM exit checks the cpuid faulting state and the CPL.
kvm_require_cpl is even kind enough to inject the GP fault for us.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[Return "1" from kvm_emulate_cpuid, it's not void. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- detect and use the keyless subset mode (guests without
storage keys)
- fix vSIE support for sdnxc
- fix machine check data for guarded storage
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJY+dkNAAoJEBF7vIC1phx8HC4P+gJtKOhNT9+mAkI3ZEuIH+UY
OB+2XIkvJpw1F47ZpL9dReEUtOYdF+r0sBmeel+zWjzfwLFBrZqgVRLG18St1JkQ
i4jGkW7oQyZ7Lc9fhZA5hcRpaN8zN571+wAiYPNtg13oTzsm2N1gmMVBKVC6LMWi
aoNGornkow9g6VhQiA9yl5ZHtbWEHT2hTE7CmUKPSAqe/SsdTSZ6svSCzWeRsUmT
+/bmgZuy07Uol6N99xQWGWtYSKQML32THA/b14i14GEy2c7dUnMyE3PQ0jnOi4zd
Jg9xr2+VOgcZvubhXrslVOQr34qZkAkuCWFC8hqgc6LYJjKXimW/9MCOVDqsLHD8
gIFXVHaIX30HwUEK4A3pDByBRVYyZ5RtdB/LcIoslbYathig+JYaXEPw2PHmEi2K
IwUK+dBJpQUpjh2djZBxmC2zMFjzDqQ0lB1y1UkZUeHFVTYSzsi/TI+ovrg6rNJ5
WQLJIOBY0jNEWatM5ZNUfXdiCO8na0bVJlZU+YUoVFXu0DNmotf8gwfuqj3V2j7v
JoHGGpjZyw6wwi0DPGo2tJ7h78s5psRGJIuBmRl9FaTn8KavZBLE+/s7TNO+SOTH
Q08Qvowoz7eVZlKWtF3XG6FgrptNKpXO3YPj/c+kHW+b39JUr6t/1uoKWw7PrP8d
IIXwghvNwgq8hWgt+zD6
=Z5HR
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: Guarded storage fixup and keyless subset mode
- detect and use the keyless subset mode (guests without
storage keys)
- fix vSIE support for sdnxc
- fix machine check data for guarded storage
vmm_exclusive=0 leads to KVM setting X86_CR4_VMXE always and calling
VMXON only when the vcpu is loaded. X86_CR4_VMXE is used as an
indication in cpu_emergency_vmxoff() (called on kdump) if VMXOFF has to be
called. This is obviously not the case if both are used independtly.
Calling VMXOFF without a previous VMXON will result in an exception.
In addition, X86_CR4_VMXE is used as a mean to test if VMX is already in
use by another VMM in hardware_enable(). So there can't really be
co-existance. If the other VMM is prepared for co-existance and does a
similar check, only one VMM can exist. If the other VMM is not prepared
and blindly sets/clears X86_CR4_VMXE, we will get inconsistencies with
X86_CR4_VMXE.
As we also had bug reports related to clearing of vmcs with vmm_exclusive=0
this seems to be pretty much untested. So let's better drop it.
While at it, directly move setting/clearing X86_CR4_VMXE into
kvm_cpu_vmxon/off.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the KSS facility is available on the machine, we also make it
available for our KVM guests.
The KSS facility bypasses storage key management as long as the guest
does not issue a related instruction. When that happens, the control is
returned to the host, which has to turn off KSS for a guest vcpu
before retrying the instruction.
Signed-off-by: Corey S. McQuay <csmcquay@linux.vnet.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
When entering the hyp stub implemented in the idmap, we try to
be mindful of the fact that we could be running a Thumb-2 kernel
by adding 1 to the address we compute. Unfortunately, the assembler
also knows about this trick, and has already generated an address
that has bit 0 set in the litteral pool.
Our superfluous correction ends up confusing the CPU entierely,
as we now branch to the stub in ARM mode instead of Thumb, and on
a possibly unaligned address for good measure. From that point,
nothing really good happens.
The obvious fix in to remove this stupid target PC correction.
Fixes: 6bebcecb6c ("ARM: KVM: Allow the main HYP code to use the init hyp stub implementation")
Reported-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
The assembler defaults to emiting the short form of ADR, leading
to an out-of-range immediate. Using the wide version solves this
issue.
Fixes: bc845e4fbb ("ARM: KVM: Implement HVC_RESET_VECTORS stub hypercall in the init code")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
According to the PowerISA 2.07, mtspr and mfspr should not always
generate an illegal instruction exception when being used with an
undefined SPR, but rather treat the instruction as a NOP or inject a
privilege exception in some cases, too - depending on the SPR number.
Also turn the printk here into a ratelimited print statement, so that
the guest can not flood the dmesg log of the host by issueing lots of
illegal mtspr/mfspr instruction here.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.
This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.
To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.
If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.
This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.
This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.
This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.
Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reworks helpers for checking TCE update parameters in way they
can be used in KVM.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.
This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).
If a requested chunk of memory has not been preregistered, this will
fall back to non-preregistered case and lock rmap.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.
Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a capability number for in-kernel support for VFIO on
SPAPR platform.
The capability will tell the user space whether in-kernel handlers of
H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space
must not attempt allocating a TCE table in the host kernel via
the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests
will not be passed to the user space which is desired action in
the situation like that.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in the commits in the topic/ppc-kvm branch of the powerpc
tree to get the changes to arch/powerpc which subsequent patches will
rely on.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At the moment the userspace can request a table smaller than a page size
and this value will be stored as kvmppc_spapr_tce_table::size.
However the actual allocated size will still be aligned to the system
page size as alloc_page() is used there.
This aligns the table size up to the system page size. It should not
change the existing behaviour but when in-kernel TCE acceleration patchset
reaches the upstream kernel, this will allow small TCE tables be
accelerated as well: PCI IODA iommu_table allocator already aligns
the size and, without this patch, an IOMMU group won't attach to LIOBN
due to the mismatching table size.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
PR KVM page fault handler performs eaddr to pte translation for a guest,
however kvmppc_mmu_book3s_64_xlate() does not preserve WIMG bits
(storage control) in the kvmppc_pte struct. If PR KVM is running as
a second level guest under HV KVM, and PR KVM tries inserting HPT entry,
this fails in HV KVM if it already has this mapping.
This preserves WIMG bits between kvmppc_mmu_book3s_64_xlate() and
kvmppc_mmu_map_page().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At the moment kvmppc_mmu_map_page() returns -1 if
mmu_hash_ops.hpte_insert() fails for any reason so the page fault handler
resumes the guest and it faults on the same address again.
This adds distinction to kvmppc_mmu_map_page() to return -EIO if
mmu_hash_ops.hpte_insert() failed for a reason other than full pteg.
At the moment only pSeries_lpar_hpte_insert() returns -2 if
plpar_pte_enter() failed with a code other than H_PTEG_FULL.
Other mmu_hash_ops.hpte_insert() instances can only fail with
-1 "full pteg".
With this change, if PR KVM fails to update HPT, it can signal
the userspace about this instead of returning to guest and having
the very same page fault over and over again.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
@is_mmio has never been used since introduction in
commit 2f4cf5e42d ("Add book3s.c") from 2009.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
* A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kcalloc".
This issue was detected by using the Coccinelle software.
* Replace the specification of a data type by a pointer dereference
to make the corresponding size determination a bit safer according to
the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Add a jump target so that a bit of exception handling can be better reused
at the end of this function.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
For completeness, this adds emulation of the lfiwax and lfiwzx
instructions. With this, all floating-point load and store instructions
as of Power ISA V2.07 are emulated.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds emulation for the following integer loads and stores,
thus enabling them to be used in a guest for accessing emulated
MMIO locations.
- lhaux
- lwaux
- lwzux
- ldu
- lwa
- stdux
- stwux
- stdu
- ldbrx
- stdbrx
Previously, most of these would cause an emulation failure exit to
userspace, though ldu and lwa got treated incorrectly as ld, and
stdu got treated incorrectly as std.
This also tidies up some of the formatting and updates the comment
listing instructions that still need to be implemented.
With this, all integer loads and stores that are defined in the Power
ISA v2.07 are emulated, except for those that are permitted to trap
when used on cache-inhibited or write-through mappings (and which do
in fact trap on POWER8), that is, lmw/stmw, lswi/stswi, lswx/stswx,
lq/stq, and l[bhwdq]arx/st[bhwdq]cx.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds missing stdx emulation for emulated MMIO accesses by KVM
guests. This allows the Mellanox mlx5_core driver from recent kernels
to work when MMIO emulation is enforced by userspace.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This provides functions that can be used for generating interrupts
indicating that a given functional unit (floating point, vector, or
VSX) is unavailable. These functions will be used in instruction
emulation code.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When iterating over the used LRs, be careful not to try to access
an unused LR, or even an unimplemented one if you're unlucky...
Reviewed-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
When emulating a GICv2-on-GICv3, special care must be taken to only
save/restore VMCR_EL2 when ICC_SRE_EL1.SRE is cleared. Otherwise,
all Group-0 interrupts end-up being delivered as FIQ, which is
probably not what the guest expects, as demonstrated here with
an unhappy EFI:
FIQ Exception at 0x000000013BD21CC4
This means that we cannot perform the load/put trick when dealing
with VMCR_EL2 (because the host has SRE set), and we have to deal
with it in the world-switch.
Fortunately, this is not the most common case (modern guests should
be able to deal with GICv3 directly), and the performance is not worse
than what it was before the VMCR optimization.
Reviewed-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
Fix potential races in kvm_psci_vcpu_on() by taking the kvm->lock
mutex. In general, it's a bad idea to allow more than one PSCI_CPU_ON
to process the same target VCPU at the same time. One such problem
that may arise is that one PSCI_CPU_ON could be resetting the target
vcpu, which fills the entire sys_regs array with a temporary value
including the MPIDR register, while another looks up the VCPU based
on the MPIDR value, resulting in no target VCPU found. Resolves both
races found with the kvm-unit-tests/arm/psci unit test.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <cdall@linaro.org>
Reported-by: Levente Kurusa <lkurusa@redhat.com>
Suggested-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christoffer Dall <cdall@linaro.org>
I have introduced this bug when applying and simplifying Paolo's patch
as we agreed on the list. The original was "x &= ~y; if (z) x |= y;".
Here is the story of a bad workflow:
A maintainer was already testing with the intended change, but it was
applied only to a testing repo on a different machine. When the time
to push tested patches to kvm/next came, he realized that this change
was missing and quickly added it to the maintenance repo, didn't test
again (because the change is trivial, right), and pushed the world to
fire.
Fixes: ae1e2d1082 ("kvm: nVMX: support EPT accessed/dirty bits")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>