Commit Graph

9047 Commits

Author SHA1 Message Date
Sean Christopherson
0bba8fc24c KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is set
Set KVM_REQ_EVENT if INIT or SIPI is pending when the guest enables GIF.
INIT in particular is blocked when GIF=0 and needs to be processed when
GIF is toggled to '1'.  This bug has been masked by (a) KVM calling
->check_nested_events() in the core run loop and (b) hypervisors toggling
GIF from 0=>1 only when entering guest mode (L1 entering L2).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220921003201.1441511-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:37:19 -04:00
Paolo Bonzini
bf7f9352af KVM: x86: lapic does not have to process INIT if it is blocked
Do not return true from kvm_vcpu_has_events() if the vCPU isn' going to
immediately process a pending INIT/SIPI.  INIT/SIPI shouldn't be treated
as wake events if they are blocked.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[sean: rebase onto refactored INIT/SIPI helpers, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220921003201.1441511-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:37:19 -04:00
Sean Christopherson
a61353acc5 KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific
Rename kvm_apic_has_events() to kvm_apic_has_pending_init_or_sipi() so
that it's more obvious that "events" really just means "INIT or SIPI".

Opportunistically clean up a weirdly worded comment that referenced
kvm_apic_has_events() instead of kvm_apic_accept_events().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220921003201.1441511-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:37:18 -04:00
Sean Christopherson
1b7a1b78d6 KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed
Rename and invert kvm_vcpu_latch_init() to kvm_apic_init_sipi_allowed()
so as to match the behavior of {interrupt,nmi,smi}_allowed(), and expose
the helper so that it can be used by kvm_vcpu_has_events() to determine
whether or not an INIT or SIPI is pending _and_ can be taken immediately.

Opportunistically replaced usage of the "latch" terminology with "blocked"
and/or "allowed", again to align with KVM's terminology used for all other
event types.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220921003201.1441511-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:37:18 -04:00
Sean Christopherson
2ea89c7f7f KVM: nVMX: Make an event request when pending an MTF nested VM-Exit
Set KVM_REQ_EVENT when MTF becomes pending to ensure that KVM will run
through inject_pending_event() and thus vmx_check_nested_events() prior
to re-entering the guest.

MTF currently works by virtue of KVM's hack that calls
kvm_check_nested_events() from kvm_vcpu_running(), but that hack will
be removed in the near future.  Until that call is removed, the patch
introduces no real functional change.

Fixes: 5ef8acbdd6 ("KVM: nVMX: Emulate MTF when performing instruction emulation")
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220921003201.1441511-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:37:18 -04:00
Paolo Bonzini
5b4ac1a1b7 KVM: x86: make vendor code check for all nested events
Interrupts, NMIs etc. sent while in guest mode are already handled
properly by the *_interrupt_allowed callbacks, but other events can
cause a vCPU to be runnable that are specific to guest mode.

In the case of VMX there are two, the preemption timer and the
monitor trap.  The VMX preemption timer is already special cased via
the hv_timer_pending callback, but the purpose of the callback can be
easily extended to MTF or in fact any other event that can occur only
in guest mode.

Rename the callback and add an MTF check; kvm_arch_vcpu_runnable()
now can return true if an MTF is pending, without relying on
kvm_vcpu_running()'s call to kvm_check_nested_events().  Until that call
is removed, however, the patch introduces no functional change.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220921003201.1441511-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:37:17 -04:00
Sean Christopherson
40aaa5b6da KVM: x86: Allow force_emulation_prefix to be written without a reload
Allow force_emulation_prefix to be written by privileged userspace
without reloading KVM.  The param does not have any persistent affects
and is trivial to snapshot.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-28-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:12 -04:00
Sean Christopherson
e746c1f1b9 KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events()
Rename inject_pending_events() to kvm_check_and_inject_events() in order
to capture the fact that it handles more than just pending events, and to
(mostly) align with kvm_check_nested_events(), which omits the "inject"
for brevity.

Add a comment above kvm_check_and_inject_events() to provide a high-level
synopsis, and to document a virtualization hole (KVM erratum if you will)
that exists due to KVM not strictly tracking instruction boundaries with
respect to coincident instruction restarts and asynchronous events.

No functional change inteded.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-25-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:11 -04:00
Sean Christopherson
65ec8f01be KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior
Document the oddities of ICEBP interception (trap-like #DB is intercepted
as a fault-like exception), and how using VMX's inner "skip" helper
deliberately bypasses the pending MTF and single-step #DB logic.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-24-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:11 -04:00
Sean Christopherson
7055fb1131 KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
Treat pending TRIPLE_FAULTS as pending exceptions.  A triple fault is an
exception for all intents and purposes, it's just not tracked as such
because there's no vector associated the exception.  E.g. if userspace
were to set vcpu->request_interrupt_window while running L2 and L2 hit a
triple fault, a triple fault nested VM-Exit should be synthesized to L1
before exiting to userspace with KVM_EXIT_IRQ_WINDOW_OPEN.

Link: https://lore.kernel.org/all/YoVHAIGcFgJit1qp@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-23-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:11 -04:00
Sean Christopherson
7709aba8f7 KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
Morph pending exceptions to pending VM-Exits (due to interception) when
the exception is queued instead of waiting until nested events are
checked at VM-Entry.  This fixes a longstanding bug where KVM fails to
handle an exception that occurs during delivery of a previous exception,
KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
paging), and KVM determines that the exception is in the guest's domain,
i.e. queues the new exception for L2.  Deferring the interception check
causes KVM to esclate various combinations of injected+pending exceptions
to double fault (#DF) without consulting L1's interception desires, and
ends up injecting a spurious #DF into L2.

KVM has fudged around the issue for #PF by special casing emulated #PF
injection for shadow paging, but the underlying issue is not unique to
shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
Other exceptions are affected as well, e.g. if KVM is intercepting #GP
for one of SVM's workaround or for the VMware backdoor emulation stuff.
The other cases have gone unnoticed because the #DF is spurious if and
only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
would have injected #DF anyways.

The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
if #PF injection forced a nested VM-Exit and the emulator finds itself
back in L1.  Allowing for direct-to-VM-Exit queueing also neatly solves
the async #PF in L2 mess; no need to set a magic flag and token, simply
queue a #PF nested VM-Exit.

Deal with event migration by flagging that a pending exception was queued
by userspace and check for interception at the next KVM_RUN, e.g. so that
KVM does the right thing regardless of the order in which userspace
restores nested state vs. event state.

When "getting" events from userspace, simply drop any pending excpetion
that is destined to be intercepted if there is also an injected exception
to be migrated.  Ideally, KVM would migrate both events, but that would
require new ABI, and practically speaking losing the event is unlikely to
be noticed, let alone fatal.  The injected exception is captured, RIP
still points at the original faulting instruction, etc...  So either the
injection on the target will trigger the same intercepted exception, or
the source of the intercepted exception was transient and/or
non-deterministic, thus dropping it is ok-ish.

Fixes: a04aead144 ("KVM: nSVM: fix running nested guests when npt=0")
Fixes: feaf0c7dc4 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-22-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:10 -04:00
Sean Christopherson
f43f8a3ba9 KVM: nVMX: Document priority of all known events on Intel CPUs
Add a gigantic comment above vmx_check_nested_events() to document the
priorities of all known events on Intel CPUs.  Intel's SDM doesn't
include VMX-specific events in its "Priority Among Concurrent Events",
which makes it painfully difficult to suss out the correct priority
between things like Monitor Trap Flag VM-Exits and pending #DBs.

Kudos to Jim Mattson for doing the hard work of collecting and
interpreting the priorities from various locations throughtout the SDM
(because putting them all in one place in the SDM would be too easy).

Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-21-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:10 -04:00
Sean Christopherson
2b384165f4 KVM: nVMX: Add a helper to identify low-priority #DB traps
Add a helper to identify "low"-priority #DB traps, i.e. trap-like #DBs
that aren't TSS T flag #DBs, and tweak the related code to operate on any
queued exception.  A future commit will separate exceptions that are
intercepted by L1, i.e. cause nested VM-Exit, from those that do NOT
trigger nested VM-Exit.  I.e. there will be multiple exception structs
and multiple invocations of the helpers.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-20-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:10 -04:00
Sean Christopherson
28360f8870 KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential VM-Exit
Determine whether or not new events can be injected after checking nested
events.  If a VM-Exit occurred during nested event handling, any previous
event that needed re-injection is gone from's KVM perspective; the event
is captured in the vmc*12 VM-Exit information, but doesn't exist in terms
of what needs to be done for entry to L1.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-19-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:09 -04:00
Sean Christopherson
6c593b5276 KVM: x86: Hoist nested event checks above event injection logic
Perform nested event checks before re-injecting exceptions/events into
L2.  If a pending exception causes VM-Exit to L1, re-injecting events
into vmcs02 is premature and wasted effort.  Take care to ensure events
that need to be re-injected are still re-injected if checking for nested
events "fails", i.e. if KVM needs to force an immediate entry+exit to
complete the to-be-re-injecteed event.

Keep the "can_inject" logic the same for now; it too can be pushed below
the nested checks, but is a slightly riskier change (see past bugs about
events not being properly purged on nested VM-Exit).

Add and/or modify comments to better document the various interactions.
Of note is the comment regarding "blocking" previously injected NMIs and
IRQs if an exception is pending.  The old comment isn't wrong strictly
speaking, but it failed to capture the reason why the logic even exists.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-18-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:09 -04:00
Sean Christopherson
81601495c5 KVM: x86: Use kvm_queue_exception_e() to queue #DF
Queue #DF by recursing on kvm_multiple_exception() by way of
kvm_queue_exception_e() instead of open coding the behavior.  This will
allow KVM to Just Work when a future commit moves exception interception
checks (for L2 => L1) into kvm_multiple_exception().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-17-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:08 -04:00
Sean Christopherson
72c14e00bd KVM: x86: Formalize blocking of nested pending exceptions
Capture nested_run_pending as block_pending_exceptions so that the logic
of why exceptions are blocked only needs to be documented once instead of
at every place that employs the logic.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-16-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:08 -04:00
Sean Christopherson
d4963e319f KVM: x86: Make kvm_queued_exception a properly named, visible struct
Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch
in anticipation of adding a second instance in kvm_vcpu_arch to handle
exceptions that occur when vectoring an injected exception and are
morphed to VM-Exit instead of leading to #DF.

Opportunistically take advantage of the churn to rename "nr" to "vector".

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-15-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:08 -04:00
Sean Christopherson
6ad75c5c99 KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
Rename the kvm_x86_ops hook for exception injection to better reflect
reality, and to align with pretty much every other related function name
in KVM.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-14-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:07 -04:00
Sean Christopherson
bfcb08a0b9 KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
Treat #PFs that occur during emulation of ENCLS as, wait for it, emulated
page faults.  Practically speaking, this is a glorified nop as the
exception is never of the nested flavor, and it's extremely unlikely the
guest is relying on the side effect of an implicit INVLPG on the faulting
address.

Fixes: 70210c044b ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-13-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:07 -04:00
Sean Christopherson
593a5c2e3c KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
Clear mtf_pending on nested VM-Exit instead of handling the clear on a
case-by-case basis in vmx_check_nested_events().  The pending MTF should
never survive nested VM-Exit, as it is a property of KVM's run of the
current L2, i.e. should never affect the next L2 run by L1.  In practice,
this is likely a nop as getting to L1 with nested_run_pending is
impossible, and KVM doesn't correctly handle morphing a pending exception
that occurs on a prior injected exception (need for re-injected exception
being the other case where MTF isn't cleared).  However, KVM will
hopefully soon correctly deal with a pending exception on top of an
injected exception.

Add a TODO to document that KVM has an inversion priority bug between
SMIs and MTF (and trap-like #DBS), and that KVM also doesn't properly
save/restore MTF across SMI/RSM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-12-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:07 -04:00
Sean Christopherson
c2086eca86 KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
Fall through to handling other pending exception/events for L2 if SIPI
is pending while the CPU is not in Wait-for-SIPI.  KVM correctly ignores
the event, but incorrectly returns immediately, e.g. a SIPI coincident
with another event could lead to KVM incorrectly routing the event to L1
instead of L2.

Fixes: bf0cd88ce3 ("KVM: x86: emulate wait-for-SIPI and SIPI-VMExit")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-11-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:06 -04:00
Sean Christopherson
0701ec903e KVM: x86: Use DR7_GD macro instead of open coding check in emulator
Use DR7_GD in the emulator instead of open coding the check, and drop a
comically wrong comment.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-10-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:06 -04:00
Sean Christopherson
5623f751bd KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1)
Add a dedicated "exception type" for #DBs, as #DBs can be fault-like or
trap-like depending the sub-type of #DB, and effectively defer the
decision of what to do with the #DB to the caller.

For the emulator's two calls to exception_type(), treat the #DB as
fault-like, as the emulator handles only code breakpoint and general
detect #DBs, both of which are fault-like.

For event injection, which uses exception_type() to determine whether to
set EFLAGS.RF=1 on the stack, keep the current behavior of not setting
RF=1 for #DBs.  Intel and AMD explicitly state RF isn't set on code #DBs,
so exempting by failing the "== EXCPT_FAULT" check is correct.  The only
other fault-like #DB is General Detect, and despite Intel and AMD both
strongly implying (through omission) that General Detect #DBs should set
RF=1, hardware (multiple generations of both Intel and AMD), in fact does
not.  Through insider knowledge, extreme foresight, sheer dumb luck, or
some combination thereof, KVM correctly handled RF for General Detect #DBs.

Fixes: 38827dbd3f ("KVM: x86: Do not update EFLAGS on faulting emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-9-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:06 -04:00
Sean Christopherson
b9d44f9091 KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
Service TSS T-flag #DBs prior to pending MTFs, as such #DBs are higher
priority than MTF.  KVM itself doesn't emulate TSS #DBs, and any such
exceptions injected from L1 will be handled by hardware (or morphed to
a fault-like exception if injection fails), but theoretically userspace
could pend a TSS T-flag #DB in conjunction with a pending MTF.

Note, there's no known use case this fixes, it's purely to be technically
correct with respect to Intel's SDM.

Cc: Oliver Upton <oupton@google.com>
Cc: Peter Shier <pshier@google.com>
Fixes: 5ef8acbdd6 ("KVM: nVMX: Emulate MTF when performing instruction emulation")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-8-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:05 -04:00
Sean Christopherson
8d178f4607 KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
Exclude General Detect #DBs, which have fault-like behavior but also have
a non-zero payload (DR6.BD=1), from nVMX's handling of pending debug
traps.  Opportunistically rewrite the comment to better document what is
being checked, i.e. "has a non-zero payload" vs. "has a payload", and to
call out the many caveats surrounding #DBs that KVM dodges one way or
another.

Cc: Oliver Upton <oupton@google.com>
Cc: Peter Shier <pshier@google.com>
Fixes: 684c0422da ("KVM: nVMX: Handle pending #DB when injecting INIT VM-exit")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-7-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:05 -04:00
Sean Christopherson
baf67ca8e5 KVM: x86: Suppress code #DBs on Intel if MOV/POP SS blocking is active
Suppress code breakpoints if MOV/POP SS blocking is active and the guest
CPU is Intel, i.e. if the guest thinks it's running on an Intel CPU.
Intel CPUs inhibit code #DBs when MOV/POP SS blocking is active, whereas
AMD (and its descendents) do not.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-6-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:05 -04:00
Sean Christopherson
d500e1ed3d KVM: x86: Allow clearing RFLAGS.RF on forced emulation to test code #DBs
Extend force_emulation_prefix to an 'int' and use bit 1 as a flag to
indicate that KVM should clear RFLAGS.RF before emulating, e.g. to allow
tests to force emulation of code breakpoints in conjunction with MOV/POP
SS blocking, which is impossible without KVM intervention as VMX
unconditionally sets RFLAGS.RF on intercepted #UD.

Make the behavior controllable so that tests can also test RFLAGS.RF=1
(again in conjunction with code #DBs).

Note, clearing RFLAGS.RF won't create an infinite #DB loop as the guest's
IRET from the #DB handler will return to the instruction and not the
prefix, i.e. the restart won't force emulation.

Opportunistically convert the permissions to the preferred octal format.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-5-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:04 -04:00
Sean Christopherson
750f8fcb26 KVM: x86: Don't check for code breakpoints when emulating on exception
Don't check for code breakpoints during instruction emulation if the
emulation was triggered by exception interception.  Code breakpoints are
the highest priority fault-like exception, and KVM only emulates on
exceptions that are fault-like.  Thus, if hardware signaled a different
exception, then the vCPU is already passed the stage of checking for
hardware breakpoints.

This is likely a glorified nop in terms of functionality, and is more for
clarification and is technically an optimization.  Intel's SDM explicitly
states vmcs.GUEST_RFLAGS.RF on exception interception is the same as the
value that would have been saved on the stack had the exception not been
intercepted, i.e. will be '1' due to all fault-like exceptions setting RF
to '1'.  AMD says "guest state saved ... is the processor state as of the
moment the intercept triggers", but that begs the question, "when does
the intercept trigger?".

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-4-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:04 -04:00
Sean Christopherson
eba9799b5a KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
Deliberately truncate the exception error code when shoving it into the
VMCS (VM-Entry field for vmcs01 and vmcs02, VM-Exit field for vmcs12).
Intel CPUs are incapable of handling 32-bit error codes and will never
generate an error code with bits 31:16, but userspace can provide an
arbitrary error code via KVM_SET_VCPU_EVENTS.  Failure to drop the bits
on exception injection results in failed VM-Entry, as VMX disallows
setting bits 31:16.  Setting the bits on VM-Exit would at best confuse
L1, and at worse induce a nested VM-Entry failure, e.g. if L1 decided to
reinject the exception back into L2.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-3-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:04 -04:00
Sean Christopherson
d953540430 KVM: nVMX: Unconditionally purge queued/injected events on nested "exit"
Drop pending exceptions and events queued for re-injection when leaving
nested guest mode, even if the "exit" is due to VM-Fail, SMI, or forced
by host userspace.  Failure to purge events could result in an event
belonging to L2 being injected into L1.

This _should_ never happen for VM-Fail as all events should be blocked by
nested_run_pending, but it's possible if KVM, not the L1 hypervisor, is
the source of VM-Fail when running vmcs02.

SMI is a nop (barring unknown bugs) as recognition of SMI and thus entry
to SMM is blocked by pending exceptions and re-injected events.

Forced exit is definitely buggy, but has likely gone unnoticed because
userspace probably follows the forced exit with KVM_SET_VCPU_EVENTS (or
some other ioctl() that purges the queue).

Fixes: 4f350c6dbc ("kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-2-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:03 -04:00
Hou Wenlong
794663e13f KVM: x86: Add missing trace points for RDMSR/WRMSR in emulator path
Since the RDMSR/WRMSR emulation uses a sepearte emualtor interface,
the trace points for RDMSR/WRMSR can be added in emulator path like
normal path.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/39181a9f777a72d61a4d0bb9f6984ccbd1de2ea3.1661930557.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:03 -04:00
Hou Wenlong
36d546d59a KVM: x86: Return emulator error if RDMSR/WRMSR emulation failed
The return value of emulator_{get|set}_mst_with_filter() is confused,
since msr access error and emulator error are mixed. Although,
KVM_MSR_RET_* doesn't conflict with X86EMUL_IO_NEEDED at present, it is
better to convert msr access error to emulator error if error value is
needed.

So move "r < 0" handling for wrmsr emulation into the set helper function,
then only X86EMUL_* is returned in the helper functions. Also add "r < 0"
check in the get helper function, although KVM doesn't return -errno
today, but assuming that will always hold true is unnecessarily risking.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/09b2847fc3bcb8937fb11738f0ccf7be7f61d9dd.1661930557.git.houwenlong.hwl@antgroup.com
[sean: wrap changelog less aggressively]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:03 -04:00
Jilin Yuan
b85a97b851 KVM: x86/mmu: fix repeated words in comments
Delete the redundant word 'to'.

Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com>
Link: https://lore.kernel.org/r/20220831125217.12313-1-yuanjilin@cdjrlc.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:02 -04:00
Vitaly Kuznetsov
37d145ef62 KVM: nVMX: Use cached host MSR_IA32_VMX_MISC value for setting up nested MSR
vmcs_config has cached host MSR_IA32_VMX_MISC value, use it for setting
up nested MSR_IA32_VMX_MISC in nested_vmx_setup_ctls_msrs() and avoid the
redundant rdmsr().

No (real) functional change intended.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-34-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:02 -04:00
Vitaly Kuznetsov
0809d9b05a KVM: VMX: Cache MSR_IA32_VMX_MISC in vmcs_config
Like other host VMX control MSRs, MSR_IA32_VMX_MISC can be cached in
vmcs_config to avoid the need to re-read it later, e.g. from
cpu_has_vmx_intel_pt() or cpu_has_vmx_shadow_vmcs().

No (real) functional change intended.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-33-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:01 -04:00
Vitaly Kuznetsov
bcdf201f8a KVM: nVMX: Use sanitized allowed-1 bits for VMX control MSRs
Using raw host MSR values for setting up nested VMX control MSRs is
incorrect as some features need to disabled, e.g. when KVM runs as
a nested hypervisor on Hyper-V and uses Enlightened VMCS or when a
workaround for IA32_PERF_GLOBAL_CTRL is applied. For non-nested VMX, this
is done in setup_vmcs_config() and the result is stored in vmcs_config.
Use it for setting up allowed-1 bits in nested VMX MSRs too.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-32-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:03:00 -04:00
Vitaly Kuznetsov
66a329be4b KVM: nVMX: Always set required-1 bits of pinbased_ctls to PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR
Similar to exit_ctls_low, entry_ctls_low, and procbased_ctls_low,
pinbased_ctls_low should be set to PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR
and not host's MSR_IA32_VMX_PINBASED_CTLS value |=
PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR.

The commit eabeaaccfc ("KVM: nVMX: Clean up and fix pin-based
execution controls") which introduced '|=' doesn't mention anything
about why this is needed, the change seems rather accidental.

Note: normally, required-1 portion of MSR_IA32_VMX_PINBASED_CTLS should
be equal to PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR so no behavioral change
is expected, however, it is (in theory) possible to observe something
different there when e.g. KVM is running as a nested hypervisor. Hope
this doesn't happen in practice.

Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-31-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:59 -04:00
Vitaly Kuznetsov
9d78d6fb18 KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()
As a preparation to reusing the result of setup_vmcs_config() for setting
up nested VMX control MSRs, move LOAD_IA32_PERF_GLOBAL_CTRL errata handling
to vmx_vmexit_ctrl()/vmx_vmentry_ctrl() and print the warning from
hardware_setup(). While it seems reasonable to not expose
LOAD_IA32_PERF_GLOBAL_CTRL controls to L1 hypervisor on buggy CPUs,
such change would inevitably break live migration from older KVMs
where the controls are exposed. Keep the status quo for now, L1 hypervisor
itself is supposed to take care of the errata.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-30-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:58 -04:00
Jim Mattson
aef46a6476 KVM: x86: VMX: Replace some Intel model numbers with mnemonics
Intel processor code names are more familiar to many readers than
their decimal model numbers.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-29-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:57 -04:00
Sean Christopherson
64f80ea73b KVM: VMX: Adjust CR3/INVPLG interception for EPT=y at runtime, not setup
Clear the CR3 and INVLPG interception controls at runtime based on
whether or not EPT is being _used_, as opposed to clearing the bits at
setup if EPT is _supported_ in hardware, and then restoring them when EPT
is not used.  Not mucking with the base config will allow using the base
config as the starting point for emulating the VMX capability MSRs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-28-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:57 -04:00
Vitaly Kuznetsov
a83bea73fa KVM: VMX: Add missing CPU based VM execution controls to vmcs_config
As a preparation to reusing the result of setup_vmcs_config() in
nested VMX MSR setup, add the CPU based VM execution controls which KVM
doesn't use but supports for nVMX to KVM_OPT_VMX_CPU_BASED_VM_EXEC_CONTROL
and filter them out in vmx_exec_control().

No functional change intended.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-27-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:56 -04:00
Vitaly Kuznetsov
f16e47429e KVM: VMX: Add missing VMEXIT controls to vmcs_config
As a preparation to reusing the result of setup_vmcs_config() in
nested VMX MSR setup, add the VMEXIT controls which KVM doesn't
use but supports for nVMX to KVM_OPT_VMX_VM_EXIT_CONTROLS and
filter them out in vmx_vmexit_ctrl().

No functional change intended.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-26-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:55 -04:00
Vitaly Kuznetsov
e89e1e2302 KVM: VMX: Move CPU_BASED_CR8_{LOAD,STORE}_EXITING filtering out of setup_vmcs_config()
As a preparation to reusing the result of setup_vmcs_config() in
nested VMX MSR setup, move CPU_BASED_CR8_{LOAD,STORE}_EXITING filtering
to vmx_exec_control().

No functional change intended.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-25-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:55 -04:00
Vitaly Kuznetsov
ee087b4da0 KVM: VMX: Extend VMX controls macro shenanigans
When VMX controls macros are used to set or clear a control bit, make
sure that this bit was checked in setup_vmcs_config() and thus is properly
reflected in vmcs_config.

Opportunistically drop pointless "< 0" check for adjust_vmx_controls()'s
return value.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-24-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:54 -04:00
Sean Christopherson
ebb3c8d409 KVM: VMX: Don't toggle VM_ENTRY_IA32E_MODE for 32-bit kernels/KVM
Don't toggle VM_ENTRY_IA32E_MODE in 32-bit kernels/KVM and instead bug
the VM if KVM attempts to run the guest with EFER.LMA=1. KVM doesn't
support running 64-bit guests with 32-bit hosts.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-23-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:53 -04:00
Vitaly Kuznetsov
1dae276569 KVM: VMX: Tweak the special handling of SECONDARY_EXEC_ENCLS_EXITING in setup_vmcs_config()
SECONDARY_EXEC_ENCLS_EXITING is the only control which is conditionally
added to the 'optional' checklist in setup_vmcs_config() but the special
case can be avoided by always checking for its presence first and filtering
out the result later.

Note: the situation when SECONDARY_EXEC_ENCLS_EXITING is present but
cpu_has_sgx() is false is possible when SGX is "soft-disabled", e.g. if
software writes MCE control MSRs or there's an uncorrectable #MC.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-22-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:52 -04:00
Vitaly Kuznetsov
378c4c1850 KVM: VMX: Check CPU_BASED_{INTR,NMI}_WINDOW_EXITING in setup_vmcs_config()
CPU_BASED_{INTR,NMI}_WINDOW_EXITING controls are toggled dynamically by
vmx_enable_{irq,nmi}_window, handle_interrupt_window(), handle_nmi_window()
but setup_vmcs_config() doesn't check their existence. Add the check and
filter the controls out in vmx_exec_control().

Note: KVM explicitly supports CPUs without VIRTUAL_NMIS and all these CPUs
are supposedly lacking NMI_WINDOW_EXITING too. Adjust cpu_has_virtual_nmis()
accordingly.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-21-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:51 -04:00
Vitaly Kuznetsov
ffaaf5913f KVM: VMX: Check VM_ENTRY_IA32E_MODE in setup_vmcs_config()
VM_ENTRY_IA32E_MODE control is toggled dynamically by vmx_set_efer()
and setup_vmcs_config() doesn't check its existence. On the contrary,
nested_vmx_setup_ctls_msrs() doesn set it on x86_64. Add the missing
check and filter the bit out in vmx_vmentry_ctrl().

No (real) functional change intended as all existing CPUs supporting
long mode and VMX are supposed to have it.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-20-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:51 -04:00
Sean Christopherson
f4c93d1a0e KVM: nVMX: Always emulate PERF_GLOBAL_CTRL VM-Entry/VM-Exit controls
Advertise VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL as being supported
for nested VMs irrespective of hardware support.  KVM fully emulates
the controls, i.e. manually emulates MSR writes on entry/exit, and never
propagates the guest settings directly to vmcs02.

In addition to allowing L1 VMMs to use the controls on older hardware,
unconditionally advertising the controls will also allow KVM to use its
vmcs01 configuration as the basis for the nested VMX configuration
without causing a regression (due the errata which causes KVM to "hide"
the control from vmcs01 but not vmcs12).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-19-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:50 -04:00
Sean Christopherson
def9d705c0 KVM: nVMX: Don't propagate vmcs12's PERF_GLOBAL_CTRL settings to vmcs02
Don't propagate vmcs12's VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL to vmcs02.
KVM doesn't disallow L1 from using VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL
even when KVM itself doesn't use the control, e.g. due to the various
CPU errata that where the MSR can be corrupted on VM-Exit.

Preserve KVM's (vmcs01) setting to hopefully avoid having to toggle the
bit in vmcs02 at a later point.  E.g. if KVM is loading PERF_GLOBAL_CTRL
when running L1, then odds are good KVM will also load the MSR when
running L2.

Fixes: 8bf00a5299 ("KVM: VMX: add support for switching of PERF_GLOBAL_CTRL")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-18-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:49 -04:00
Vitaly Kuznetsov
9bcb90650e KVM: VMX: Get rid of eVMCS specific VMX controls sanitization
With the updated eVMCSv1 definition, there's no known 'problematic'
controls which are exposed in VMX control MSRs but are not present in
eVMCSv1: all known Hyper-V versions either don't expose the new fields
by not setting bits in the VMX feature controls or support the new
eVMCS revision.

Get rid of VMX control MSRs filtering for KVM on Hyper-V.

Note: VMX control MSRs filtering for Hyper-V on KVM
(nested_evmcs_filter_control_msr()) stays as even the updated eVMCSv1
definition doesn't have all the features implemented by KVM and some
fields are still missing. Moreover, nested_evmcs_filter_control_msr()
has to support the original eVMCSv1 version when VMM wishes so.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-17-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:48 -04:00
Vitaly Kuznetsov
4da77090b0 KVM: nVMX: Support PERF_GLOBAL_CTRL with enlightened VMCS
Enlightened VMCS v1 got updated and now includes the required fields
for loading PERF_GLOBAL_CTRL upon VMENTER/VMEXIT features. For KVM on
Hyper-V enablement, KVM can just observe VMX control MSRs and use the
features (with or without eVMCS) when possible.

Hyper-V on KVM is messier as Windows 11 guests fail to boot if the
controls are advertised and a new PV feature flag, CPUID.0x4000000A.EBX
BIT(0), is not set.  Honor the Hyper-V CPUID feature flag to play nice
with Windows guests.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-16-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:47 -04:00
Sean Christopherson
3ff8a13d41 KVM: nVMX: WARN once and fail VM-Enter if eVMCS sees VMFUNC[63:32] != 0
WARN and reject nested VM-Enter if KVM is using eVMCS and manages to
allow a non-zero value in the upper 32 bits of VM-function controls.  The
eVMCS code assumes all inputs are 32-bit values and subtly drops the
upper bits.  WARN instead of adding proper "support", it's unlikely the
upper bits will be defined/used in the next decade.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-15-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:47 -04:00
Vitaly Kuznetsov
dea6e140d9 KVM: x86: hyper-v: Cache HYPERV_CPUID_NESTED_FEATURES CPUID leaf
KVM has to check guest visible HYPERV_CPUID_NESTED_FEATURES.EBX CPUID
leaf to know which Enlightened VMCS definition to use (original or 2022
update). Cache the leaf along with other Hyper-V CPUID feature leaves
to make the check quick.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-12-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:44 -04:00
Vitaly Kuznetsov
c9d31986e8 KVM: nVMX: Support several new fields in eVMCSv1
Enlightened VMCS v1 definition was updated with new fields, add
support for them for Hyper-V on KVM.

Note: SSP, CET and Guest LBR features are not supported by KVM yet
and 'struct vmcs12' has no corresponding fields.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-11-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:43 -04:00
Vitaly Kuznetsov
b19e4ff5e5 KVM: VMX: Define VMCS-to-EVMCS conversion for the new fields
Enlightened VMCS v1 definition was updated with new fields, support
them in KVM by defining VMCS-to-EVMCS conversion.

Note: SSP, CET and Guest LBR features are not supported by KVM yet and
the corresponding fields are not defined in 'enum vmcs_field', leave
them commented out for now.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-10-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:42 -04:00
Sean Christopherson
6cce93de28 KVM: nVMX: Use CC() macro to handle eVMCS unsupported controls checks
Locally #define and use the nested virtualization Consistency Check (CC)
macro to handle eVMCS unsupported controls checks.  Using the macro loses
the existing printing of the unsupported controls, but that's a feature
and not a bug.  The existing approach is flawed because the @err param to
trace_kvm_nested_vmenter_failed() is the error code, not the error value.

The eVMCS trickery mostly works as __print_symbolic() falls back to
printing the raw hex value, but that subtly relies on not having a match
between the unsupported value and VMX_VMENTER_INSTRUCTION_ERRORS.

If it's really truly necessary to snapshot the bad value, then the
tracepoint can be extended in the future.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-9-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:42 -04:00
Vitaly Kuznetsov
f4d361b4c2 KVM: nVMX: Refactor unsupported eVMCS controls logic to use 2-d array
Refactor the handling of unsupported eVMCS to use a 2-d array to store
the set of unsupported controls.  KVM's handling of eVMCS is completely
broken as there is no way for userspace to query which features are
unsupported, nor does KVM prevent userspace from attempting to enable
unsupported features.  A future commit will remedy that by filtering and
enforcing unsupported features when eVMCS, but that needs to be opt-in
from userspace to avoid breakage, i.e. KVM needs to maintain its legacy
behavior by snapshotting the exact set of controls that are currently
(un)supported by eVMCS.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
[sean: split to standalone patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-8-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:41 -04:00
Sean Christopherson
85ab071af8 KVM: nVMX: Treat eVMCS as enabled for guest iff Hyper-V is also enabled
When querying whether or not eVMCS is enabled on behalf of the guest,
treat eVMCS as enable if and only if Hyper-V is enabled/exposed to the
guest.

Note, flows that come from the host, e.g. KVM_SET_NESTED_STATE, must NOT
check for Hyper-V being enabled as KVM doesn't require guest CPUID to be
set before most ioctls().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-7-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:40 -04:00
Sean Christopherson
3be29eb7b5 KVM: x86: Report error when setting CPUID if Hyper-V allocation fails
Return -ENOMEM back to userspace if allocating the Hyper-V vCPU struct
fails when enabling Hyper-V in guest CPUID.  Silently ignoring failure
means that KVM will not have an up-to-date CPUID cache if allocating the
struct succeeds later on, e.g. when activating SynIC.

Rejecting the CPUID operation also guarantess that vcpu->arch.hyperv is
non-NULL if hyperv_enabled is true, which will allow for additional
cleanup, e.g. in the eVMCS code.

Note, the initialization needs to be done before CPUID is set, and more
subtly before kvm_check_cpuid(), which potentially enables dynamic
XFEATURES.  Sadly, there's no easy way to avoid exposing Hyper-V details
to CPUID or vice versa.  Expose kvm_hv_vcpu_init() and the Hyper-V CPUID
signature to CPUID instead of exposing cpuid_entry2_find() outside of
CPUID code.  It's hard to envision kvm_hv_vcpu_init() being misused,
whereas cpuid_entry2_find() absolutely shouldn't be used outside of core
CPUID code.

Fixes: 10d7bf1e46 ("KVM: x86: hyper-v: Cache guest CPUID leaves determining features availability")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220830133737.1539624-6-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:39 -04:00
Sean Christopherson
1cac8d9f6b KVM: x86: Check for existing Hyper-V vCPU in kvm_hv_vcpu_init()
When potentially allocating/initializing the Hyper-V vCPU struct, check
for an existing instance in kvm_hv_vcpu_init() instead of requiring
callers to perform the check.  Relying on callers to do the check is
risky as it's all too easy for KVM to overwrite vcpu->arch.hyperv and
leak memory, and it adds additional burden on callers without much
benefit.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20220830133737.1539624-5-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:38 -04:00
Vitaly Kuznetsov
ce2196b831 KVM: x86: Zero out entire Hyper-V CPUID cache before processing entries
Wipe the whole 'hv_vcpu->cpuid_cache' with memset() instead of having to
zero each particular member when the corresponding CPUID entry was not
found.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
[sean: split to separate patch]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20220830133737.1539624-4-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:38 -04:00
Uros Bizjak
57abfa11ba KVM: VMX: Do not declare vmread_error() asmlinkage
There is no need to declare vmread_error() asmlinkage, its arguments
can be passed via registers for both 32-bit and 64-bit targets.
Function argument registers are considered call-clobbered registers,
they are saved in the trampoline just before the function call and
restored afterwards.

Dropping "asmlinkage" patch unifies trampoline function argument handling
between 32-bit and 64-bit targets and improves generated code for 32-bit
targets.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220817144045.3206-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:35 -04:00
Liam Ni
e390f4d69d KVM:x86: Clean up ModR/M "reg" initialization in reg op decoding
Refactor decode_register_operand() to get the ModR/M register if and
only if the instruction uses a ModR/M encoding to make it more obvious
how the register operand is retrieved.

Signed-off-by: Liam Ni <zhiguangni01@gmail.com>
Link: https://lore.kernel.org/r/20220908141210.1375828-1-zhiguangni01@zhaoxin.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:34 -04:00
Mingwei Zhang
02dfc44f20 KVM: x86: Print guest pgd in kvm_nested_vmenter()
Print guest pgd in kvm_nested_vmenter() to enrich the information for
tracing. When tdp is enabled, print the value of tdp page table (EPT/NPT);
when tdp is disabled, print the value of non-nested CR3.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20220825225755.907001-4-mizhang@google.com
[sean: print nested_cr3 vs. nested_eptp vs. guest_cr3]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:33 -04:00
David Matlack
37ef0be269 KVM: nVMX: Add tracepoint for nested VM-Enter
Call trace_kvm_nested_vmenter() during nested VMLAUNCH/VMRESUME to bring
parity with nSVM's usage of the tracepoint during nested VMRUN.

Attempt to use analagous VMCS fields to the VMCB fields that are
reported in the SVM case:

"int_ctl": 32-bit field of the VMCB that the CPU uses to deliver virtual
interrupts. The analagous VMCS field is the 16-bit "guest interrupt
status".

"event_inj": 32-bit field of VMCB that is used to inject events
(exceptions and interrupts) into the guest. The analagous VMCS field
is the "VM-entry interruption-information field".

"npt_enabled": 1 when the VCPU has enabled nested paging. The analagous
VMCS field is the enable-EPT execution control.

"npt_addr": 64-bit field when the VCPU has enabled nested paging. The
analagous VMCS field is the ept_pointer.

Signed-off-by: David Matlack <dmatlack@google.com>
[move the code into the nested_vmx_enter_non_root_mode().]
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20220825225755.907001-3-mizhang@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:32 -04:00
Mingwei Zhang
89e54ec592 KVM: x86: Update trace function for nested VM entry to support VMX
Update trace function for nested VM entry to support VMX. Existing trace
function only supports nested VMX and the information printed out is AMD
specific.

So, rename trace_kvm_nested_vmrun() to trace_kvm_nested_vmenter(), since
'vmenter' is generic. Add a new field 'isa' to recognize Intel and AMD;
Update the output to print out VMX/SVM related naming respectively, eg.,
vmcb vs. vmcs; npt vs. ept.

Opportunistically update the call site of trace_kvm_nested_vmenter() to
make one line per parameter.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20220825225755.907001-2-mizhang@google.com
[sean: align indentation, s/update/rename in changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:32 -04:00
Sean Christopherson
bff0adc40c KVM: x86: Use u64 for address and error code in page fault tracepoint
Track the address and error code as 64-bit values in the page fault
tracepoint.  When TDP is enabled, the address is a GPA and thus can be a
64-bit value even on 32-bit hosts.  And SVM's #NPF genereates 64-bit
error codes.

Opportunistically clean up the formatting.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:31 -04:00
Wonhyuk Yang
faa03b3972 KVM: Add extra information in kvm_page_fault trace point
Currently, kvm_page_fault trace point provide fault_address and error
code. However it is not enough to find which cpu and instruction
cause kvm_page_faults. So add vcpu id and instruction pointer in
kvm_page_fault trace point.

Cc: Baik Song An <bsahn@etri.re.kr>
Cc: Hong Yeon Kim <kimhy@etri.re.kr>
Cc: Taeung Song <taeung@reallinux.co.kr>
Cc: linuxgeek@linuxgeek.io
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Link: https://lore.kernel.org/r/20220510071001.87169-1-vvghjk1234@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:30 -04:00
Paolo Bonzini
db25eb87ad KVM: SVM: remove unnecessary check on INIT intercept
Since svm_check_nested_events() is now handling INIT signals, there is
no need to latch it until the VMEXIT is injected.  The only condition
under which INIT signals are latched is GIF=0.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220819165643.83692-1-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:28 -04:00
Uros Bizjak
afe30b59d3 KVM/VMX: Avoid stack engine synchronization uop in __vmx_vcpu_run
Avoid instructions with explicit uses of the stack pointer between
instructions that implicitly refer to it. The sequence of
POP %reg; ADD $x, %RSP; POP %reg forces emission of synchronization
uop to synchronize the value of the stack pointer in the stack engine
and the out-of-order core.

Using POP with the dummy register instead of ADD $x, %RSP results in a
smaller code size and faster code.

The patch also fixes the reference to the wrong register in the
nearby comment.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220816211010.25693-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26 12:02:27 -04:00
Sean Christopherson
50b2d49baf KVM: x86: Inject #UD on emulated XSETBV if XSAVES isn't enabled
Inject #UD when emulating XSETBV if CR4.OSXSAVE is not set.  This also
covers the "XSAVE not supported" check, as setting CR4.OSXSAVE=1 #GPs if
XSAVE is not supported (and userspace gets to keep the pieces if it
forces incoherent vCPU state).

Add a comment to kvm_emulate_xsetbv() to call out that the CPU checks
CR4.OSXSAVE before checking for intercepts.  AMD'S APM implies that #UD
has priority (says that intercepts are checked before #GP exceptions),
while Intel's SDM says nothing about interception priority.  However,
testing on hardware shows that both AMD and Intel CPUs prioritize the #UD
over interception.

Fixes: 02d4160fbd ("x86: KVM: add xsetbv to the emulator")
Cc: stable@vger.kernel.org
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220824033057.3576315-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-22 17:04:20 -04:00
Dr. David Alan Gilbert
a1020a25e6 KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES
Allow FP and SSE state to be saved and restored via KVM_{G,SET}_XSAVE on
XSAVE-capable hosts even if their bits are not exposed to the guest via
XCR0.

Failing to allow FP+SSE first showed up as a QEMU live migration failure,
where migrating a VM from a pre-XSAVE host, e.g. Nehalem, to an XSAVE
host failed due to KVM rejecting KVM_SET_XSAVE.  However, the bug also
causes problems even when migrating between XSAVE-capable hosts as
KVM_GET_SAVE won't set any bits in user_xfeatures if XSAVE isn't exposed
to the guest, i.e. KVM will fail to actually migrate FP+SSE.

Because KVM_{G,S}ET_XSAVE are designed to allowing migrating between
hosts with and without XSAVE, KVM_GET_XSAVE on a non-XSAVE (by way of
fpu_copy_guest_fpstate_to_uabi()) always sets the FP+SSE bits in the
header so that KVM_SET_XSAVE will work even if the new host supports
XSAVE.

Fixes: ad856280dd ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")
bz: https://bugzilla.redhat.com/show_bug.cgi?id=2079311
Cc: stable@vger.kernel.org
Cc: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
[sean: add comment, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220824033057.3576315-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-22 17:04:19 -04:00
Sean Christopherson
ee519b3a2a KVM: x86: Reinstate kvm_vcpu_arch.guest_supported_xcr0
Reinstate the per-vCPU guest_supported_xcr0 by partially reverting
commit 988896bb6182; the implicit assessment that guest_supported_xcr0 is
always the same as guest_fpu.fpstate->user_xfeatures was incorrect.

kvm_vcpu_after_set_cpuid() isn't the only place that sets user_xfeatures,
as user_xfeatures is set to fpu_user_cfg.default_features when guest_fpu
is allocated via fpu_alloc_guest_fpstate() => __fpstate_reset().
guest_supported_xcr0 on the other hand is zero-allocated.  If userspace
never invokes KVM_SET_CPUID2, supported XCR0 will be '0', whereas the
allowed user XFEATURES will be non-zero.

Practically speaking, the edge case likely doesn't matter as no sane
userspace will live migrate a VM without ever doing KVM_SET_CPUID2. The
primary motivation is to prepare for KVM intentionally and explicitly
setting bits in user_xfeatures that are not set in guest_supported_xcr0.

Because KVM_{G,S}ET_XSAVE can be used to svae/restore FP+SSE state even
if the host doesn't support XSAVE, KVM needs to set the FP+SSE bits in
user_xfeatures even if they're not allowed in XCR0, e.g. because XCR0
isn't exposed to the guest.  At that point, the simplest fix is to track
the two things separately (allowed save/restore vs. allowed XCR0).

Fixes: 988896bb61 ("x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0")
Cc: stable@vger.kernel.org
Cc: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220824033057.3576315-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-22 17:04:19 -04:00
Miaohe Lin
604f533262 KVM: x86/mmu: add missing update to max_mmu_rmap_size
The update to statistic max_mmu_rmap_size is unintentionally removed by
commit 4293ddb788 ("KVM: x86/mmu: Remove redundant spte present check
in mmu_set_spte"). Add missing update to it or max_mmu_rmap_size will
always be nonsensical 0.

Fixes: 4293ddb788 ("KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Message-Id: <20220907080657.42898-1-linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-22 17:03:20 -04:00
Paolo Bonzini
29250ba51b PCI interpretation compile fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmMMpCYACgkQ41TmuOI4
 ufirVA/+NpbDVgiGehgV9ngukDFWNNBOvPeKFIjnIWmhcTvRtQ/YWIALKJSinaVl
 jyPOTM0cEcrxhIPM2kHmnkZhwECRwM32Ox2N6ftmOPEJk8xrP6glfCuwJ6Yl/pNU
 EBLWzgZGRZyDGmCNo0jvaycPIh/WkeNKGwe/3FlI8gHx7iALvihE/0IFoPlCEWky
 dbetAAx5W0fRCoKscbMh0q2mDXuz4KBu3yij9wMGlhSzLcrSr9qejCmjEdERruYM
 c9hrLtIcy8cS9Epa5kOnz0PekK4EUlecB66BjnvsXabjTVNlF8MJwEG5EZfGxP+a
 jdthEjXaO5u7z4Y5NKxDwTFslmzX3PPbdBka9VGmhLkkV1/jGZ5RKKVY7+MljV74
 ZhhVkwUB0D3c7Wbpo+S1mVY6EHi/RO4TkaPCORsmCvwYXFZyhNsteKnWdRbcu3Iu
 aIjKamGeEfe8Mqojmzaksql9ScNnW9/INUyObz5UruGXFg2uFd0Vae20ZBpn7DYh
 gBYKZAFArNzHUUpP5zMBPnwpTGt8MuurI7rnVEvVRqsy4SofbYZYuFdxpn2obzui
 zPQY/lg6eXUWZkbfqJNrA1JN3cTSxVQCO96z8yYp/RazYiKMGUgZ7Ps7WlZkxRNE
 OSMkssXz6cIL1QPbu75iu6FMQGcVAywH2EN8jJh6E0ceUcVhsB0=
 =ENc3
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-master-6.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

PCI interpretation compile fixes
2022-09-01 19:21:27 -04:00
Paolo Bonzini
22c6a0ef6b KVM: x86: check validity of argument to KVM_SET_MP_STATE
An invalid argument to KVM_SET_MP_STATE has no effect other than making the
vCPU fail to run at the next KVM_RUN.  Since it is extremely unlikely that
any userspace is relying on it, fail with -EINVAL just like for other
architectures.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-01 19:20:59 -04:00
Miaohe Lin
3c0ba05ce9 KVM: x86: fix memoryleak in kvm_arch_vcpu_create()
When allocating memory for mci_ctl2_banks fails, KVM doesn't release
mce_banks leading to memoryleak. Fix this issue by calling kfree()
for it when kcalloc() fails.

Fixes: 281b52780b ("KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs.")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Message-Id: <20220901122300.22298-1-linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-01 19:20:58 -04:00
Jim Mattson
0204750bd4 KVM: x86: Mask off unsupported and unknown bits of IA32_ARCH_CAPABILITIES
KVM should not claim to virtualize unknown IA32_ARCH_CAPABILITIES
bits. When kvm_get_arch_capabilities() was originally written, there
were only a few bits defined in this MSR, and KVM could virtualize all
of them. However, over the years, several bits have been defined that
KVM cannot just blindly pass through to the guest without additional
work (such as virtualizing an MSR promised by the
IA32_ARCH_CAPABILITES feature bit).

Define a mask of supported IA32_ARCH_CAPABILITIES bits, and mask off
any other bits that are set in the hardware MSR.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 5b76a3cff0 ("KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20220830174947.2182144-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-01 19:20:58 -04:00
Yosry Ahmed
43a063cab3 KVM: x86/mmu: count KVM mmu usage in secondary pagetable stats.
Count the pages used by KVM mmu on x86 in memory stats under secondary
pagetable stats (e.g. "SecPageTables" in /proc/meminfo) to give better
visibility into the memory consumption of KVM mmu in a similar way to
how normal user page tables are accounted.

Add the inner helper in common KVM, ARM will also use it to count stats
in a future commit.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org> # generic KVM changes
Link: https://lore.kernel.org/r/20220823004639.2387269-3-yosryahmed@google.com
Link: https://lore.kernel.org/r/20220823004639.2387269-4-yosryahmed@google.com
[sean: squash x86 usage to workaround modpost issues]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-08-30 07:41:12 -07:00
Miaohe Lin
d7c9bfb9ca KVM: x86/mmu: fix memoryleak in kvm_mmu_vendor_module_init()
When register_shrinker() fails, KVM doesn't release the percpu counter
kvm_total_used_mmu_pages leading to memoryleak. Fix this issue by calling
percpu_counter_destroy() when register_shrinker() fails.

Fixes: ab271bd4df ("x86: kvm: propagate register_shrinker return code")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20220823063237.47299-1-linmiaohe@huawei.com
[sean: tweak shortlog and changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-08-24 13:47:49 -07:00
Michal Luczaj
6aa5c47c35 KVM: x86/emulator: Fix handing of POP SS to correctly set interruptibility
The emulator checks the wrong variable while setting the CPU
interruptibility state, the target segment is embedded in the instruction
opcode, not the ModR/M register.  Fix the condition.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Fixes: a5457e7bcf ("KVM: emulate: POP SS triggers a MOV SS shadow too")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20220821215900.1419215-1-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-08-24 13:45:40 -07:00
Junaid Shahid
b24ede2253 kvm: x86: Do proper cleanup if kvm_x86_ops->vm_init() fails
If vm_init() fails [which can happen, for instance, if a memory
allocation fails during avic_vm_init()], we need to cleanup some
state in order to avoid resource leaks.

Signed-off-by: Junaid Shahid <junaids@google.com>
Link: https://lore.kernel.org/r/20220729224329.323378-1-junaids@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-08-24 13:41:59 -07:00
Nick Desaulniers
a0a12c3ed0 asm goto: eradicate CC_HAS_ASM_GOTO
GCC has supported asm goto since 4.5, and Clang has since version 9.0.0.
The minimum supported versions of these tools for the build according to
Documentation/process/changes.rst are 5.1 and 11.0.0 respectively.

Remove the feature detection script, Kconfig option, and clean up some
fallback code that is no longer supported.

The removed script was also testing for a GCC specific bug that was
fixed in the 4.7 release.

Also remove workarounds for bpftrace using clang older than 9.0.0, since
other BPF backend fixes are required at this point.

Link: https://lore.kernel.org/lkml/CAK7LNATSr=BXKfkdW8f-H5VT_w=xBpT2ZQcZ7rm6JfkdE+QnmA@mail.gmail.com/
Link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637
Acked-by: Borislav Petkov <bp@suse.de>
Suggested-by: Masahiro Yamada <masahiroy@kernel.org>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-21 10:06:28 -07:00
Jim Mattson
020dac4187 KVM: VMX: Heed the 'msr' argument in msr_write_intercepted()
Regardless of the 'msr' argument passed to the VMX version of
msr_write_intercepted(), the function always checks to see if a
specific MSR (IA32_SPEC_CTRL) is intercepted for write.  This behavior
seems unintentional and unexpected.

Modify the function so that it checks to see if the provided 'msr'
index is intercepted for write.

Fixes: 67f4b9969c ("KVM: nVMX: Handle dynamic MSR intercept toggling")
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220810213050.2655000-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 07:38:04 -04:00
Junaid Shahid
b64d740ea7 kvm: x86: mmu: Always flush TLBs when enabling dirty logging
When A/D bits are not available, KVM uses a software access tracking
mechanism, which involves making the SPTEs inaccessible. However,
the clear_young() MMU notifier does not flush TLBs. So it is possible
that there may still be stale, potentially writable, TLB entries.
This is usually fine, but can be problematic when enabling dirty
logging, because it currently only does a TLB flush if any SPTEs were
modified. But if all SPTEs are in access-tracked state, then there
won't be a TLB flush, which means that the guest could still possibly
write to memory and not have it reflected in the dirty bitmap.

So just unconditionally flush the TLBs when enabling dirty logging.
As an alternative, KVM could explicitly check the MMU-Writable bit when
write-protecting SPTEs to decide if a flush is needed (instead of
checking the Writable bit), but given that a flush almost always happens
anyway, so just making it unconditional seems simpler.

Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20220810224939.2611160-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 07:38:03 -04:00
Junaid Shahid
1441ca1494 kvm: x86: mmu: Drop the need_remote_flush() function
This is only used by kvm_mmu_pte_write(), which no longer actually
creates the new SPTE and instead just clears the old SPTE. So we
just need to check if the old SPTE was shadow-present instead of
calling need_remote_flush(). Hence we can drop this function. It was
incomplete anyway as it didn't take access-tracking into account.

This patch should not result in any functional change.

Signed-off-by: Junaid Shahid <junaids@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220723024316.2725328-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 07:38:02 -04:00
Josh Poimboeuf
3d9606b0e0 x86/kvm: Fix "missing ENDBR" BUG for fastop functions
The following BUG was reported:

  traps: Missing ENDBR: andw_ax_dx+0x0/0x10 [kvm]
  ------------[ cut here ]------------
  kernel BUG at arch/x86/kernel/traps.c:253!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
   <TASK>
   asm_exc_control_protection+0x2b/0x30
  RIP: 0010:andw_ax_dx+0x0/0x10 [kvm]
  Code: c3 cc cc cc cc 0f 1f 44 00 00 66 0f 1f 00 48 19 d0 c3 cc cc cc
        cc 0f 1f 40 00 f3 0f 1e fa 20 d0 c3 cc cc cc cc 0f 1f 44 00 00
        <66> 0f 1f 00 66 21 d0 c3 cc cc cc cc 0f 1f 40 00 66 0f 1f 00 21
        d0

   ? andb_al_dl+0x10/0x10 [kvm]
   ? fastop+0x5d/0xa0 [kvm]
   x86_emulate_insn+0x822/0x1060 [kvm]
   x86_emulate_instruction+0x46f/0x750 [kvm]
   complete_emulated_mmio+0x216/0x2c0 [kvm]
   kvm_arch_vcpu_ioctl_run+0x604/0x650 [kvm]
   kvm_vcpu_ioctl+0x2f4/0x6b0 [kvm]
   ? wake_up_q+0xa0/0xa0

The BUG occurred because the ENDBR in the andw_ax_dx() fastop function
had been incorrectly "sealed" (converted to a NOP) by apply_ibt_endbr().

Objtool marked it to be sealed because KVM has no compile-time
references to the function.  Instead KVM calculates its address at
runtime.

Prevent objtool from annotating fastop functions as sealable by creating
throwaway dummy compile-time references to the functions.

Fixes: 6649fa876d ("x86/ibt,kvm: Add ENDBR to fastops")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Debugged-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Message-Id: <0d4116f90e9d0c1b754bb90c585e6f0415a1c508.1660837839.git.jpoimboe@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 04:05:42 -04:00
Josh Poimboeuf
22472d1260 x86/kvm: Simplify FOP_SETCC()
SETCC_ALIGN and FOP_ALIGN are both 16.  Remove the special casing for
FOP_SETCC() and just make it a normal fastop.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Message-Id: <7c13d94d1a775156f7e36eed30509b274a229140.1660837839.git.jpoimboe@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 04:05:42 -04:00
Chao Peng
20ec3ebd70 KVM: Rename mmu_notifier_* to mmu_invalidate_*
The motivation of this renaming is to make these variables and related
helper functions less mmu_notifier bound and can also be used for non
mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
better describe the purpose of 'invalidating' a page that those
variables are used for.

  - mmu_notifier_seq/range_start/range_end are renamed to
    mmu_invalidate_seq/range_start/range_end.

  - mmu_notifier_retry{_hva} helper functions are renamed to
    mmu_invalidate_retry{_hva}.

  - mmu_notifier_count is renamed to mmu_invalidate_in_progress to
    avoid confusion with mn_active_invalidate_count.

  - While here, also update kvm_inc/dec_notifier_count() to
    kvm_mmu_invalidate_begin/end() to match the change for
    mmu_notifier_count.

No functional change intended.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19 04:05:41 -04:00
Jason Wang
3163600cab x86: Fix various duplicate-word comment typos
[ mingo: Consolidated 4 very similar patches into one, it's silly to spread this out. ]

Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220715044809.20572-1-wangborong@cdjrlc.com
2022-08-15 19:17:52 +02:00
Linus Torvalds
e18a90427c * Xen timer fixes
* Documentation formatting fixes
 
 * Make rseq selftest compatible with glibc-2.35
 
 * Fix handling of illegal LEA reg, reg
 
 * Cleanup creation of debugfs entries
 
 * Fix steal time cache handling bug
 
 * Fixes for MMIO caching
 
 * Optimize computation of number of LBRs
 
 * Fix uninitialized field in guest_maxphyaddr < host_maxphyaddr path
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmL0qwIUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroML1gf/SK6by+Gi0r7WSkrDjU94PKZ8D6Y3
 fErMhratccc9IfL3p90IjCVhEngfdQf5UVHExA5TswgHHAJTpECzuHya9TweQZc5
 2rrTvufup0MNALfzkSijrcI80CBvrJc6JyOCkv0BLp7yqXUrnrm0OOMV2XniS7y0
 YNn2ZCy44tLqkNiQrLhJQg3EsXu9l7okGpHSVO6iZwC7KKHvYkbscVFa/AOlaAwK
 WOZBB+1Ee+/pWhxsngM1GwwM3ZNU/jXOSVjew5plnrD4U7NYXIDATszbZAuNyxqV
 5gi+wvTF1x9dC6Tgd3qF7ouAqtT51BdRYaI9aYHOYgvzqdNFHWJu3XauDQ==
 =vI6Q
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm updates from Paolo Bonzini:

 - Xen timer fixes

 - Documentation formatting fixes

 - Make rseq selftest compatible with glibc-2.35

 - Fix handling of illegal LEA reg, reg

 - Cleanup creation of debugfs entries

 - Fix steal time cache handling bug

 - Fixes for MMIO caching

 - Optimize computation of number of LBRs

 - Fix uninitialized field in guest_maxphyaddr < host_maxphyaddr path

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
  KVM: x86/MMU: properly format KVM_CAP_VM_DISABLE_NX_HUGE_PAGES capability table
  Documentation: KVM: extend KVM_CAP_VM_DISABLE_NX_HUGE_PAGES heading underline
  KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refresh
  KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpers
  KVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIES
  KVM: selftests: Test all possible "invalid" PERF_CAPABILITIES.LBR_FMT vals
  KVM: selftests: Use getcpu() instead of sched_getcpu() in rseq_test
  KVM: selftests: Make rseq compatible with glibc-2.35
  KVM: Actually create debugfs in kvm_create_vm()
  KVM: Pass the name of the VM fd to kvm_create_vm_debugfs()
  KVM: Get an fd before creating the VM
  KVM: Shove vcpu stats_id init into kvm_vcpu_init()
  KVM: Shove vm stats_id init into kvm_create_vm()
  KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap gen
  KVM: x86/mmu: rename trace function name for asynchronous page fault
  KVM: x86/xen: Stop Xen timer before changing IRQ
  KVM: x86/xen: Initialize Xen timer only once
  KVM: SVM: Disable SEV-ES support if MMIO caching is disable
  KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change
  KVM: x86: Tag kvm_mmu_x86_module_init() with __init
  ...
2022-08-11 12:10:08 -07:00
Sean Christopherson
6348aafa8d KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refresh
Now that the PMU is refreshed when MSR_IA32_PERF_CAPABILITIES is written
by host userspace, zero out the number of LBR records for a vCPU during
PMU refresh if PMU_CAP_LBR_FMT is not set in PERF_CAPABILITIES instead of
handling the check at run-time.

guest_cpuid_has() is expensive due to the linear search of guest CPUID
entries, intel_pmu_lbr_is_enabled() is checked on every VM-Enter, _and_
simply enumerating the same "Model" as the host causes KVM to set the
number of LBR records to a non-zero value.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:30 -04:00
Sean Christopherson
7de8e5b6b1 KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpers
Turn vcpu_to_lbr_desc() and vcpu_to_lbr_records() into functions in order
to provide type safety, to document exactly what they return, and to
allow consuming the helpers in vmx.h.  Move the definitions as necessary
(the macros "reference" to_vmx() before its definition).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:30 -04:00
Sean Christopherson
17a024a8b9 KVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIES
Refresh the PMU if userspace modifies MSR_IA32_PERF_CAPABILITIES.  KVM
consumes the vCPU's PERF_CAPABILITIES when enumerating PEBS support, but
relies on CPUID updates to refresh the PMU.  I.e. KVM will do the wrong
thing if userspace stuffs PERF_CAPABILITIES _after_ setting guest CPUID.

Opportunistically fix a curly-brace indentation.

Fixes: c59a1f106f ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:30 -04:00
Sean Christopherson
8bad4606ac KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap gen
Add compile-time and init-time sanity checks to ensure that the MMIO SPTE
mask doesn't overlap the MMIO SPTE generation or the MMU-present bit.
The generation currently avoids using bit 63, but that's as much
coincidence as it is strictly necessarly.  That will change in the future,
as TDX support will require setting bit 63 (SUPPRESS_VE) in the mask.

Explicitly carve out the bits that are allowed in the mask so that any
future shuffling of SPTE bits doesn't silently break MMIO caching (KVM
has broken MMIO caching more than once due to overlapping the generation
with other things).

Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20220805194133.86299-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:26 -04:00
Mingwei Zhang
1685c0f325 KVM: x86/mmu: rename trace function name for asynchronous page fault
Rename the tracepoint function from trace_kvm_async_pf_doublefault() to
trace_kvm_async_pf_repeated_fault() to make it clear, since double fault
has nothing to do with this trace function.

Asynchronous Page Fault (APF) is an artifact generated by KVM when it
cannot find a physical page to satisfy an EPT violation. KVM uses APF to
tell the guest OS to do something else such as scheduling other guest
processes to make forward progress. However, when another guest process
also touches a previously APFed page, KVM halts the vCPU instead of
generating a repeated APF to avoid wasting cycles.

Double fault (#DF) clearly has a different meaning and a different
consequence when triggered. #DF requires two nested contributory exceptions
instead of two page faults faulting at the same address. A prevous bug on
APF indicates that it may trigger a double fault in the guest [1] and
clearly this trace function has nothing to do with it. So rename this
function should be a valid choice.

No functional change intended.

[1] https://www.spinics.net/lists/kvm/msg214957.html

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220807052141.69186-1-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:26 -04:00
Coleman Dietsch
c036899136 KVM: x86/xen: Stop Xen timer before changing IRQ
Stop Xen timer (if it's running) prior to changing the IRQ vector and
potentially (re)starting the timer. Changing the IRQ vector while the
timer is still running can result in KVM injecting a garbage event, e.g.
vm_xen_inject_timer_irqs() could see a non-zero xen.timer_pending from
a previous timer but inject the new xen.timer_virq.

Fixes: 5363952605 ("KVM: x86/xen: handle PV timers oneshot mode")
Cc: stable@vger.kernel.org
Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42
Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com
Signed-off-by: Coleman Dietsch <dietschc@csp.edu>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220808190607.323899-3-dietschc@csp.edu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:25 -04:00
Coleman Dietsch
af735db312 KVM: x86/xen: Initialize Xen timer only once
Add a check for existing xen timers before initializing a new one.

Currently kvm_xen_init_timer() is called on every
KVM_XEN_VCPU_ATTR_TYPE_TIMER, which is causing the following ODEBUG
crash when vcpu->arch.xen.timer is already set.

ODEBUG: init active (active state 0)
object type: hrtimer hint: xen_timer_callbac0
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:502
Call Trace:
__debug_object_init
debug_hrtimer_init
debug_init
hrtimer_init
kvm_xen_init_timer
kvm_xen_vcpu_set_attr
kvm_arch_vcpu_ioctl
kvm_vcpu_ioctl
vfs_ioctl

Fixes: 5363952605 ("KVM: x86/xen: handle PV timers oneshot mode")
Cc: stable@vger.kernel.org
Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42
Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com
Signed-off-by: Coleman Dietsch <dietschc@csp.edu>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220808190607.323899-2-dietschc@csp.edu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:25 -04:00
Sean Christopherson
0c29397ac1 KVM: SVM: Disable SEV-ES support if MMIO caching is disable
Disable SEV-ES if MMIO caching is disabled as SEV-ES relies on MMIO SPTEs
generating #NPF(RSVD), which are reflected by the CPU into the guest as
a #VC.  With SEV-ES, the untrusted host, a.k.a. KVM, doesn't have access
to the guest instruction stream or register state and so can't directly
emulate in response to a #NPF on an emulated MMIO GPA.  Disabling MMIO
caching means guest accesses to emulated MMIO ranges cause #NPF(!PRESENT),
and those flavors of #NPF cause automatic VM-Exits, not #VC.

Adjust KVM's MMIO masks to account for the C-bit location prior to doing
SEV(-ES) setup, and document that dependency between adjusting the MMIO
SPTE mask and SEV(-ES) setup.

Fixes: b09763da4d ("KVM: x86/mmu: Add module param to disable MMIO caching (for testing)")
Reported-by: Michael Roth <michael.roth@amd.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220803224957.1285926-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:25 -04:00
Sean Christopherson
c3e0c8c2e8 KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change
Fully re-evaluate whether or not MMIO caching can be enabled when SPTE
masks change; simply clearing enable_mmio_caching when a configuration
isn't compatible with caching fails to handle the scenario where the
masks are updated, e.g. by VMX for EPT or by SVM to account for the C-bit
location, and toggle compatibility from false=>true.

Snapshot the original module param so that re-evaluating MMIO caching
preserves userspace's desire to allow caching.  Use a snapshot approach
so that enable_mmio_caching still reflects KVM's actual behavior.

Fixes: 8b9e74bfbf ("KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled")
Reported-by: Michael Roth <michael.roth@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Tested-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20220803224957.1285926-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:24 -04:00
Sean Christopherson
982bae43f1 KVM: x86: Tag kvm_mmu_x86_module_init() with __init
Mark kvm_mmu_x86_module_init() with __init, the entire reason it exists
is to initialize variables when kvm.ko is loaded, i.e. it must never be
called after module initialization.

Fixes: 1d0e848060 ("KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded")
Cc: stable@vger.kernel.org
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220803224957.1285926-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:24 -04:00
Michal Luczaj
4ac5b42377 KVM: x86: emulator: Fix illegal LEA handling
The emulator mishandles LEA with register source operand. Even though such
LEA is illegal, it can be encoded and fed to CPU. In which case real
hardware throws #UD. The emulator, instead, returns address of
x86_emulate_ctxt._regs. This info leak hurts host's kASLR.

Tell the decoder that illegal LEA is not to be emulated.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20220729134801.1120-1-mhal@rbox.co>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:23 -04:00
Yu Zhang
2bc685e633 KVM: X86: avoid uninitialized 'fault.async_page_fault' from fixed-up #PF
kvm_fixup_and_inject_pf_error() was introduced to fixup the error code(
e.g., to add RSVD flag) and inject the #PF to the guest, when guest
MAXPHYADDR is smaller than the host one.

When it comes to nested, L0 is expected to intercept and fix up the #PF
and then inject to L2 directly if
- L2.MAXPHYADDR < L0.MAXPHYADDR and
- L1 has no intention to intercept L2's #PF (e.g., L2 and L1 have the
  same MAXPHYADDR value && L1 is using EPT for L2),
instead of constructing a #PF VM Exit to L1. Currently, with PFEC_MASK
and PFEC_MATCH both set to 0 in vmcs02, the interception and injection
may happen on all L2 #PFs.

However, failing to initialize 'fault' in kvm_fixup_and_inject_pf_error()
may cause the fault.async_page_fault being NOT zeroed, and later the #PF
being treated as a nested async page fault, and then being injected to L1.
Instead of zeroing 'fault' at the beginning of this function, we mannually
set the value of 'fault.async_page_fault', because false is the value we
really expect.

Fixes: 897861479c ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216178
Reported-by: Yang Lixiao <lixiao.yang@intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220718074756.53788-1-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:23 -04:00
Sean Christopherson
70c8327c11 KVM: x86: Bug the VM if an accelerated x2APIC trap occurs on a "bad" reg
Bug the VM if retrieving the x2APIC MSR/register while processing an
accelerated vAPIC trap VM-Exit fails.  In theory it's impossible for the
lookup to fail as hardware has already validated the register, but bugs
happen, and not checking the result of kvm_lapic_msr_read() would result
in consuming the uninitialized "val" if a KVM or hardware bug occurs.

Fixes: 1bd9dfec9f ("KVM: x86: Do not block APIC write for non ICR registers")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220804235028.1766253-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:23 -04:00
Paolo Bonzini
c3c28d24d9 KVM: x86: do not report preemption if the steal time cache is stale
Commit 7e2175ebd6 ("KVM: x86: Fix recording of guest steal time
/ preempted status", 2021-11-11) open coded the previous call to
kvm_map_gfn, but in doing so it dropped the comparison between the cached
guest physical address and the one in the MSR.  This cause an incorrect
cache hit if the guest modifies the steal time address while the memslots
remain the same.  This can happen with kexec, in which case the preempted
bit is written at the address used by the old kernel instead of
the old one.

Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Fixes: 7e2175ebd6 ("KVM: x86: Fix recording of guest steal time / preempted status")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:22 -04:00
Paolo Bonzini
901d3765fa KVM: x86: revalidate steal time cache if MSR value changes
Commit 7e2175ebd6 ("KVM: x86: Fix recording of guest steal time
/ preempted status", 2021-11-11) open coded the previous call to
kvm_map_gfn, but in doing so it dropped the comparison between the cached
guest physical address and the one in the MSR.  This cause an incorrect
cache hit if the guest modifies the steal time address while the memslots
remain the same.  This can happen with kexec, in which case the steal
time data is written at the address used by the old kernel instead of
the old one.

While at it, rename the variable from gfn to gpa since it is a plain
physical address and not a right-shifted one.

Reported-by: Dave Young <ruyang@redhat.com>
Reported-by: Xiaoying Yan  <yiyan@redhat.com>
Analyzed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Fixes: 7e2175ebd6 ("KVM: x86: Fix recording of guest steal time / preempted status")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10 15:08:22 -04:00
Linus Torvalds
5318b987fe More from the CPU vulnerability nightmares front:
Intel eIBRS machines do not sufficiently mitigate against RET
 mispredictions when doing a VM Exit therefore an additional RSB,
 one-entry stuffing is needed.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLqsGsACgkQEsHwGGHe
 VUpXGg//ZEkxhf3Ri7X9PknAWNG6eIEqigKqWcdnOw+Oq/GMVb6q7JQsqowK7KBZ
 AKcY5c/KkljTJNohditnfSOePyCG5nDTPgfkjzIawnaVdyJWMRCz/L4X2cv6ykDl
 2l2EvQm4Ro8XAogYhE7GzDg/osaVfx93OkLCQj278VrEMWgM/dN2RZLpn+qiIkNt
 DyFlQ7cr5UASh/svtKLko268oT4JwhQSbDHVFLMJ52VaLXX36yx4rValZHUKFdox
 ZDyj+kiszFHYGsI94KAD0dYx76p6mHnwRc4y/HkVcO8vTacQ2b9yFYBGTiQatITf
 0Nk1RIm9m3rzoJ82r/U0xSIDwbIhZlOVNm2QtCPkXqJZZFhopYsZUnq2TXhSWk4x
 GQg/2dDY6gb/5MSdyLJmvrTUtzResVyb/hYL6SevOsIRnkwe35P6vDDyp15F3TYK
 YvidZSfEyjtdLISBknqYRQD964dgNZu9ewrj+WuJNJr+A2fUvBzUebXjxHREsugN
 jWp5GyuagEKTtneVCvjwnii+ptCm6yfzgZYLbHmmV+zhinyE9H1xiwVDvo5T7DDS
 ZJCBgoioqMhp5qR59pkWz/S5SNGui2rzEHbAh4grANy8R/X5ASRv7UHT9uAo6ve1
 xpw6qnE37CLzuLhj8IOdrnzWwLiq7qZ/lYN7m+mCMVlwRWobbOo=
 =a8em
 -----END PGP SIGNATURE-----

Merge tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 eIBRS fixes from Borislav Petkov:
 "More from the CPU vulnerability nightmares front:

  Intel eIBRS machines do not sufficiently mitigate against RET
  mispredictions when doing a VM Exit therefore an additional RSB,
  one-entry stuffing is needed"

* tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation: Add LFENCE to RSB fill sequence
  x86/speculation: Add RSB VM Exit protections
2022-08-09 09:29:07 -07:00
Linus Torvalds
1d239c1eb8 IOMMU Updates for Linux v5.20/v6.0:
Including:
 
 	- Most intrusive patch is small and changes the default
 	  allocation policy for DMA addresses. Before the change the
 	  allocator tried its best to find an address in the first 4GB.
 	  But that lead to performance problems when that space gets
 	  exhaused, and since most devices are capable of 64-bit DMA
 	  these days, we changed it to search in the full DMA-mask
 	  range from the beginning.  This change has the potential to
 	  uncover bugs elsewhere, in the kernel or the hardware. There
 	  is a Kconfig option and a command line option to restore the
 	  old behavior, but none of them is enabled by default.
 
 	- Add Robin Murphy as reviewer of IOMMU code and maintainer for
 	  the dma-iommu and iova code
 
 	- Chaning IOVA magazine size from 1032 to 1024 bytes to save
 	  memory
 
 	- Some core code cleanups and dead-code removal
 
 	- Support for ACPI IORT RMR node
 
 	- Support for multiple PCI domains in the AMD-Vi driver
 
 	- ARM SMMU changes from Will Deacon:
 
 	  - Add even more Qualcomm device-tree compatible strings
 
 	  - Support dumping of IMP DEF Qualcomm registers on TLB sync
 	    timeout
 
 	  - Fix reference count leak on device tree node in Qualcomm
 	    driver
 
 	- Intel VT-d driver updates from Lu Baolu:
 
 	  - Make intel-iommu.h private
 
 	  - Optimize the use of two locks
 
 	  - Extend the driver to support large-scale platforms
 
 	  - Cleanup some dead code
 
 	- MediaTek IOMMU refactoring and support for TTBR up to 35bit
 
 	- Basic support for Exynos SysMMU v7
 
 	- VirtIO IOMMU driver gets a map/unmap_pages() implementation
 
 	- Other smaller cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmLs3DIACgkQK/BELZcB
 GuMizhAAguAnLLOkOLlR9/MhrTZfNXCUX+bfrEIevjFXMw4iPNfCCr4ydQ7EdVK6
 ZA/3Z89huYl0d0x/FELolnQi+HOeqYrfTDe4rB7TgNgwZnWa+fdHcyYkgBGyfPaV
 ilgjNcx8o//9o4NasyB6kU395jVmFxb735gMTTb+tcO9fr+/qIB6hxrHuCklxrNr
 C7wK6kkoDPi5n0QuXCSjXEx2Hk245pAWKPLwqxsUYzHGlLfl7ULOxw65BUBGvn/H
 uCsTfJFu7u+ErwQYf0qPuOwRBnRdsx9g5EAnfab8p074SoKWvbNnftIxgIRp8ZEM
 YgCbhYa1GOFI4r+XzqRzEbc0/vPSttims4Jqz0KxYs7pr5EoVifrWLJFjJdCdc2h
 Tio1gTvOq8HbH63kwYNKJhg4iSC6zVd37ihEhvfFO6LcgFl4iCfd2o9zK7oY40J4
 XoOxofVnJ2e3tzdhZ/n5quCXiudHixm6WuVa7QYKscF7Ud0tY1wWKuibdlMQTeNM
 68MvtlteKcfs1BrWzZyrFMrFeAfIY8LI82y6jdJuoNMU5LE9+5yelXBdJhnVygZ+
 Jglv1TIt6W/z1H5JgXtNVZ1wWgBm7rurOqNyfN8XCd8eP1z321CLfX8ujkhKrIWP
 ApG15cwvpnh1JX630+UFiEikTGU0fb2orMdPwYmwuu8DAsoLVHE=
 =hI2K
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v5.20-or-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:

 - The most intrusive patch is small and changes the default allocation
   policy for DMA addresses.

   Before the change the allocator tried its best to find an address in
   the first 4GB. But that lead to performance problems when that space
   gets exhaused, and since most devices are capable of 64-bit DMA these
   days, we changed it to search in the full DMA-mask range from the
   beginning.

   This change has the potential to uncover bugs elsewhere, in the
   kernel or the hardware. There is a Kconfig option and a command line
   option to restore the old behavior, but none of them is enabled by
   default.

 - Add Robin Murphy as reviewer of IOMMU code and maintainer for the
   dma-iommu and iova code

 - Chaning IOVA magazine size from 1032 to 1024 bytes to save memory

 - Some core code cleanups and dead-code removal

 - Support for ACPI IORT RMR node

 - Support for multiple PCI domains in the AMD-Vi driver

 - ARM SMMU changes from Will Deacon:
      - Add even more Qualcomm device-tree compatible strings
      - Support dumping of IMP DEF Qualcomm registers on TLB sync
        timeout
      - Fix reference count leak on device tree node in Qualcomm driver

 - Intel VT-d driver updates from Lu Baolu:
      - Make intel-iommu.h private
      - Optimize the use of two locks
      - Extend the driver to support large-scale platforms
      - Cleanup some dead code

 - MediaTek IOMMU refactoring and support for TTBR up to 35bit

 - Basic support for Exynos SysMMU v7

 - VirtIO IOMMU driver gets a map/unmap_pages() implementation

 - Other smaller cleanups and fixes

* tag 'iommu-updates-v5.20-or-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (116 commits)
  iommu/amd: Fix compile warning in init code
  iommu/amd: Add support for AVIC when SNP is enabled
  iommu/amd: Simplify and Consolidate Virtual APIC (AVIC) Enablement
  ACPI/IORT: Fix build error implicit-function-declaration
  drivers: iommu: fix clang -wformat warning
  iommu/arm-smmu: qcom_iommu: Add of_node_put() when breaking out of loop
  iommu/arm-smmu-qcom: Add SM6375 SMMU compatible
  dt-bindings: arm-smmu: Add compatible for Qualcomm SM6375
  MAINTAINERS: Add Robin Murphy as IOMMU SUBSYTEM reviewer
  iommu/amd: Do not support IOMMUv2 APIs when SNP is enabled
  iommu/amd: Do not support IOMMU_DOMAIN_IDENTITY after SNP is enabled
  iommu/amd: Set translation valid bit only when IO page tables are in use
  iommu/amd: Introduce function to check and enable SNP
  iommu/amd: Globally detect SNP support
  iommu/amd: Process all IVHDs before enabling IOMMU features
  iommu/amd: Introduce global variable for storing common EFR and EFR2
  iommu/amd: Introduce Support for Extended Feature 2 Register
  iommu/amd: Change macro for IOMMU control register bit shift to decimal value
  iommu/exynos: Enable default VM instance on SysMMU v7
  iommu/exynos: Add SysMMU v7 register set
  ...
2022-08-06 10:42:38 -07:00
Linus Torvalds
6614a3c316 - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
 
 - Some kmemleak fixes from Patrick Wang and Waiman Long
 
 - DAMON updates from SeongJae Park
 
 - memcg debug/visibility work from Roman Gushchin
 
 - vmalloc speedup from Uladzislau Rezki
 
 - more folio conversion work from Matthew Wilcox
 
 - enhancements for coherent device memory mapping from Alex Sierra
 
 - addition of shared pages tracking and CoW support for fsdax, from
   Shiyang Ruan
 
 - hugetlb optimizations from Mike Kravetz
 
 - Mel Gorman has contributed some pagealloc changes to improve latency
   and realtime behaviour.
 
 - mprotect soft-dirty checking has been improved by Peter Xu
 
 - Many other singleton patches all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
 jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
 SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
 =w/UH
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Most of the MM queue. A few things are still pending.

  Liam's maple tree rework didn't make it. This has resulted in a few
  other minor patch series being held over for next time.

  Multi-gen LRU still isn't merged as we were waiting for mapletree to
  stabilize. The current plan is to merge MGLRU into -mm soon and to
  later reintroduce mapletree, with a view to hopefully getting both
  into 6.1-rc1.

  Summary:

   - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
     Lin, Yang Shi, Anshuman Khandual and Mike Rapoport

   - Some kmemleak fixes from Patrick Wang and Waiman Long

   - DAMON updates from SeongJae Park

   - memcg debug/visibility work from Roman Gushchin

   - vmalloc speedup from Uladzislau Rezki

   - more folio conversion work from Matthew Wilcox

   - enhancements for coherent device memory mapping from Alex Sierra

   - addition of shared pages tracking and CoW support for fsdax, from
     Shiyang Ruan

   - hugetlb optimizations from Mike Kravetz

   - Mel Gorman has contributed some pagealloc changes to improve
     latency and realtime behaviour.

   - mprotect soft-dirty checking has been improved by Peter Xu

   - Many other singleton patches all over the place"

 [ XFS merge from hell as per Darrick Wong in

   https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]

* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
  tools/testing/selftests/vm/hmm-tests.c: fix build
  mm: Kconfig: fix typo
  mm: memory-failure: convert to pr_fmt()
  mm: use is_zone_movable_page() helper
  hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
  hugetlbfs: cleanup some comments in inode.c
  hugetlbfs: remove unneeded header file
  hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
  hugetlbfs: use helper macro SZ_1{K,M}
  mm: cleanup is_highmem()
  mm/hmm: add a test for cross device private faults
  selftests: add soft-dirty into run_vmtests.sh
  selftests: soft-dirty: add test for mprotect
  mm/mprotect: fix soft-dirty check in can_change_pte_writable()
  mm: memcontrol: fix potential oom_lock recursion deadlock
  mm/gup.c: fix formatting in check_and_migrate_movable_page()
  xfs: fail dax mount if reflink is enabled on a partition
  mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
  userfaultfd: don't fail on unrecognized features
  hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
  ...
2022-08-05 16:32:45 -07:00
Daniel Sneddon
2b12993220 x86/speculation: Add RSB VM Exit protections
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.

== Background ==

Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.

To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced.  eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.

== Problem ==

Here's a simplification of how guests are run on Linux' KVM:

void run_kvm_guest(void)
{
	// Prepare to run guest
	VMRESUME();
	// Clean up after guest runs
}

The execution flow for that would look something like this to the
processor:

1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()

Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:

* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.

* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".

IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.

However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.

Balanced CALL/RET instruction pairs such as in step #5 are not affected.

== Solution ==

The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e.,
eIBRS systems which enable legacy IBRS explicitly.

However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT
and most of them need a new mitigation.

Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.

The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.

In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.

There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.

  [ bp: Massage, incorporate review comments from Andy Cooper. ]

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-08-03 11:23:52 +02:00
Paolo Bonzini
31f6e3832a KVM: x86/mmu: remove unused variable
The last use of 'pfn' went away with the same-named argument to
host_pfn_mapping_level; now that the hugepage level is obtained
exclusively from the host page tables, kvm_mmu_zap_collapsible_spte
does not need to know host pfns at all.

Fixes: a8ac499bb6 ("KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-01 07:30:00 -04:00
Paolo Bonzini
63f4b21041 Merge remote-tracking branch 'kvm/next' into kvm-next-5.20
KVM/s390, KVM/x86 and common infrastructure changes for 5.20

x86:

* Permit guests to ignore single-bit ECC errors

* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache

* Intel IPI virtualization

* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS

* PEBS virtualization

* Simplify PMU emulation by just using PERF_TYPE_RAW events

* More accurate event reinjection on SVM (avoid retrying instructions)

* Allow getting/setting the state of the speaker port data bit

* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent

* "Notify" VM exit (detect microarchitectural hangs) for Intel

* Cleanups for MCE MSR emulation

s390:

* add an interface to provide a hypervisor dump for secure guests

* improve selftests to use TAP interface

* enable interpretive execution of zPCI instructions (for PCI passthrough)

* First part of deferred teardown

* CPU Topology

* PV attestation

* Minor fixes

Generic:

* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple

x86:

* Use try_cmpxchg64 instead of cmpxchg64

* Bugfixes

* Ignore benign host accesses to PMU MSRs when PMU is disabled

* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior

* x86/MMU: Allow NX huge pages to be disabled on a per-vm basis

* Port eager page splitting to shadow MMU as well

* Enable CMCI capability by default and handle injected UCNA errors

* Expose pid of vcpu threads in debugfs

* x2AVIC support for AMD

* cleanup PIO emulation

* Fixes for LLDT/LTR emulation

* Don't require refcounted "struct page" to create huge SPTEs

x86 cleanups:

* Use separate namespaces for guest PTEs and shadow PTEs bitmasks

* PIO emulation

* Reorganize rmap API, mostly around rmap destruction

* Do not workaround very old KVM bugs for L0 that runs with nesting enabled

* new selftests API for CPUID
2022-08-01 03:21:00 -04:00
Joerg Roedel
c10100a416 Merge branches 'arm/exynos', 'arm/mediatek', 'arm/msm', 'arm/smmu', 'virtio', 'x86/vt-d', 'x86/amd' and 'core' into next 2022-07-29 12:06:56 +02:00
Kai Huang
7edc3a6803 KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs()
Now kvm_tdp_mmu_zap_leafs() only zaps leaf SPTEs but not any non-root
pages within that GFN range anymore, so the comment around it isn't
right.

Fix it by shifting the comment from tdp_mmu_zap_leafs() instead of
duplicating it, as tdp_mmu_zap_leafs() is static and is only called by
kvm_tdp_mmu_zap_leafs().

Opportunistically tweak the blurb about SPTEs being cleared to (a) say
"zapped" instead of "cleared" because "cleared" will be wrong if/when
KVM allows a non-zero value for non-present SPTE (i.e. for Intel TDX),
and (b) to clarify that a flush is needed if and only if a SPTE has been
zapped since MMU lock was last acquired.

Fixes: f47e5bbbc9 ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap")
Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20220728030452.484261-1-kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 14:02:07 -04:00
Jarkko Sakkinen
6fac42f127 KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog
As Virtual Machine Save Area (VMSA) is essential in troubleshooting
attestation, dump it to the klog with the KERN_DEBUG level of priority.

Cc: Jarkko Sakkinen <jarkko@kernel.org>
Suggested-by: Harald Hoyer <harald@profian.com>
Signed-off-by: Jarkko Sakkinen <jarkko@profian.com>
Message-Id: <20220728050919.24113-1-jarkko@profian.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 14:02:06 -04:00
Sean Christopherson
6c6ab524cf KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT
Treat the NX bit as valid when using NPT, as KVM will set the NX bit when
the NX huge page mitigation is enabled (mindblowing) and trigger the WARN
that fires on reserved SPTE bits being set.

KVM has required NX support for SVM since commit b26a71a1a5 ("KVM: SVM:
Refuse to load kvm_amd if NX support is not available") for exactly this
reason, but apparently it never occurred to anyone to actually test NPT
with the mitigation enabled.

  ------------[ cut here ]------------
  spte = 0x800000018a600ee7, level = 2, rsvd bits = 0x800f0000001fe000
  WARNING: CPU: 152 PID: 15966 at arch/x86/kvm/mmu/spte.c:215 make_spte+0x327/0x340 [kvm]
  Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022
  RIP: 0010:make_spte+0x327/0x340 [kvm]
  Call Trace:
   <TASK>
   tdp_mmu_map_handle_target_level+0xc3/0x230 [kvm]
   kvm_tdp_mmu_map+0x343/0x3b0 [kvm]
   direct_page_fault+0x1ae/0x2a0 [kvm]
   kvm_tdp_page_fault+0x7d/0x90 [kvm]
   kvm_mmu_page_fault+0xfb/0x2e0 [kvm]
   npf_interception+0x55/0x90 [kvm_amd]
   svm_invoke_exit_handler+0x31/0xf0 [kvm_amd]
   svm_handle_exit+0xf6/0x1d0 [kvm_amd]
   vcpu_enter_guest+0xb6d/0xee0 [kvm]
   ? kvm_pmu_trigger_event+0x6d/0x230 [kvm]
   vcpu_run+0x65/0x2c0 [kvm]
   kvm_arch_vcpu_ioctl_run+0x355/0x610 [kvm]
   kvm_vcpu_ioctl+0x551/0x610 [kvm]
   __se_sys_ioctl+0x77/0xc0
   __x64_sys_ioctl+0x1d/0x20
   do_syscall_64+0x44/0xa0
   entry_SYSCALL_64_after_hwframe+0x46/0xb0
   </TASK>
  ---[ end trace 0000000000000000 ]---

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220723013029.1753623-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:51:43 -04:00
Suravee Suthikulpanit
1bd9dfec9f KVM: x86: Do not block APIC write for non ICR registers
The commit 5413bcba7e ("KVM: x86: Add support for vICR APIC-write
VM-Exits in x2APIC mode") introduces logic to prevent APIC write
for offset other than ICR in kvm_apic_write_nodecode() function.
This breaks x2AVIC support, which requires KVM to trap and emulate
x2APIC MSR writes.

Therefore, removes the warning and modify to logic to allow MSR write.

Fixes: 5413bcba7e ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode")
Cc: Zeng Guang <guang.zeng@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220725053356.4275-1-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:51:42 -04:00
Suravee Suthikulpanit
0a8735a6ac KVM: SVM: Do not virtualize MSR accesses for APIC LVTT register
AMD does not support APIC TSC-deadline timer mode. AVIC hardware
will generate GP fault when guest kernel writes 1 to bits [18]
of the APIC LVTT register (offset 0x32) to set the timer mode.
(Note: bit 18 is reserved on AMD system).

Therefore, always intercept and let KVM emulate the MSR accesses.

Fixes: f3d7c8aa6882 ("KVM: SVM: Fix x2APIC MSRs interception")
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220725033428.3699-1-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:51:42 -04:00
Sean Christopherson
a910b5ab6b KVM: nVMX: Set UMIP bit CR4_FIXED1 MSR when emulating UMIP
Make UMIP an "allowed-1" bit CR4_FIXED1 MSR when KVM is emulating UMIP.
KVM emulates UMIP for both L1 and L2, and so should enumerate that L2 is
allowed to have CR4.UMIP=1.  Not setting the bit doesn't immediately
break nVMX, as KVM does set/clear the bit in CR4_FIXED1 in response to a
guest CPUID update, i.e. KVM will correctly (dis)allow nested VM-Entry
based on whether or not UMIP is exposed to L1.  That said, KVM should
enumerate the bit as being allowed from time zero, e.g. userspace will
see the wrong value if the MSR is read before CPUID is written.

Fixes: 0367f205a3 ("KVM: vmx: add support for emulating UMIP")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:25:01 -04:00
Paolo Bonzini
9389d5774a Revert "KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control"
This reverts commit 03a8871add.

Since commit 03a8871add ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
VM-{Entry,Exit} control"), KVM has taken ownership of the "load
IA32_PERF_GLOBAL_CTRL" VMX entry/exit control bits, trying to set these
bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID
supports the architectural PMU (CPUID[EAX=0Ah].EAX[7:0]=1), and clear
otherwise.

This was a misguided attempt at mimicking what commit 5f76f6f5ff
("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled",
2018-10-01) did for MPX.  However, that commit was a workaround for
another KVM bug and not something that should be imitated.  Mucking with
the VMX MSRs creates a subtle, difficult to maintain ABI as KVM must
ensure that any internal changes, e.g. to how KVM handles _any_ guest
CPUID changes, yield the same functional result.  Therefore, KVM's policy
is to let userspace have full control of the guest vCPU model so long
as the host kernel is not at risk.

Now that KVM really truly ensures kvm_set_msr() will succeed by loading
PERF_GLOBAL_CTRL if and only if it exists, revert KVM's misguided and
roundabout behavior.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[sean: make it a pure revert]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220722224409.1336532-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:24:41 -04:00
Sean Christopherson
4496a6f9b4 KVM: nVMX: Attempt to load PERF_GLOBAL_CTRL on nVMX xfer iff it exists
Attempt to load PERF_GLOBAL_CTRL during nested VM-Enter/VM-Exit if and
only if the MSR exists (according to the guest vCPU model).  KVM has very
misguided handling of VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL and
attempts to force the nVMX MSR settings to match the vPMU model, i.e. to
hide/expose the control based on whether or not the MSR exists from the
guest's perspective.

KVM's modifications fail to handle the scenario where the vPMU is hidden
from the guest _after_ being exposed to the guest, e.g. by userspace
doing multiple KVM_SET_CPUID2 calls, which is allowed if done before any
KVM_RUN.  nested_vmx_pmu_refresh() is called if and only if there's a
recognized vPMU, i.e. KVM will leave the bits in the allow state and then
ultimately reject the MSR load and WARN.

KVM should not force the VMX MSRs in the first place.  KVM taking control
of the MSRs was a misguided attempt at mimicking what commit 5f76f6f5ff
("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled",
2018-10-01) did for MPX.  However, the MPX commit was a workaround for
another KVM bug and not something that should be imitated (and it should
never been done in the first place).

In other words, KVM's ABI _should_ be that userspace has full control
over the MSRs, at which point triggering the WARN that loading the MSR
must not fail is trivial.

The intent of the WARN is still valid; KVM has consistency checks to
ensure that vmcs12->{guest,host}_ia32_perf_global_ctrl is valid.  The
problem is that '0' must be considered a valid value at all times, and so
the simple/obvious solution is to just not actually load the MSR when it
does not exist.  It is userspace's responsibility to provide a sane vCPU
model, i.e. KVM is well within its ABI and Intel's VMX architecture to
skip the loads if the MSR does not exist.

Fixes: 03a8871add ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220722224409.1336532-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:24:22 -04:00
Sean Christopherson
b663f0b5f3 KVM: VMX: Add helper to check if the guest PMU has PERF_GLOBAL_CTRL
Add a helper to check of the guest PMU has PERF_GLOBAL_CTRL, which is
unintuitive _and_ diverges from Intel's architecturally defined behavior.
Even worse, KVM currently implements the check using two different (but
equivalent) checks, _and_ there has been at least one attempt to add a
_third_ flavor.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220722224409.1336532-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:23:57 -04:00
Sean Christopherson
93255bf929 KVM: VMX: Mark all PERF_GLOBAL_(OVF)_CTRL bits reserved if there's no vPMU
Mark all MSR_CORE_PERF_GLOBAL_CTRL and MSR_CORE_PERF_GLOBAL_OVF_CTRL bits
as reserved if there is no guest vPMU.  The nVMX VM-Entry consistency
checks do not check for a valid vPMU prior to consuming the masks via
kvm_valid_perf_global_ctrl(), i.e. may incorrectly allow a non-zero mask
to be loaded via VM-Enter or VM-Exit (well, attempted to be loaded, the
actual MSR load will be rejected by intel_is_valid_msr()).

Fixes: f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220722224409.1336532-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:23:40 -04:00
Paolo Bonzini
8805875aa4 Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"
Since commit 5f76f6f5ff ("KVM: nVMX: Do not expose MPX VMX controls
when guest MPX disabled"), KVM has taken ownership of the "load
IA32_BNDCFGS" and "clear IA32_BNDCFGS" VMX entry/exit controls,
trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS
MSRs if the guest's CPUID supports MPX, and clear otherwise.

The intent of the patch was to apply it to L0 in order to work around
L1 kernels that lack the fix in commit 691bd4340b ("kvm: vmx: allow
host to access guest MSR_IA32_BNDCFGS", 2017-07-04): by hiding the
control bits from L0, L1 hides BNDCFGS from KVM_GET_MSR_INDEX_LIST,
and the L1 bug is neutralized even in the lack of commit 691bd4340b.

This was perhaps a sensible kludge at the time, but a horrible
idea in the long term and in fact it has not been extended to
other CPUID bits like these:

  X86_FEATURE_LM => VM_EXIT_HOST_ADDR_SPACE_SIZE, VM_ENTRY_IA32E_MODE,
                    VMX_MISC_SAVE_EFER_LMA

  X86_FEATURE_TSC => CPU_BASED_RDTSC_EXITING, CPU_BASED_USE_TSC_OFFSETTING,
                     SECONDARY_EXEC_TSC_SCALING

  X86_FEATURE_INVPCID_SINGLE => SECONDARY_EXEC_ENABLE_INVPCID

  X86_FEATURE_MWAIT => CPU_BASED_MONITOR_EXITING, CPU_BASED_MWAIT_EXITING

  X86_FEATURE_INTEL_PT => SECONDARY_EXEC_PT_CONCEAL_VMX, SECONDARY_EXEC_PT_USE_GPA,
                          VM_EXIT_CLEAR_IA32_RTIT_CTL, VM_ENTRY_LOAD_IA32_RTIT_CTL

  X86_FEATURE_XSAVES => SECONDARY_EXEC_XSAVES

These days it's sort of common knowledge that any MSR in
KVM_GET_MSR_INDEX_LIST must allow *at least* setting it with KVM_SET_MSR
to a default value, so it is unlikely that something like commit
5f76f6f5ff will be needed again.  So revert it, at the potential cost
of breaking L1s with a 6 year old kernel.  While in principle the L0 owner
doesn't control what runs on L1, such an old hypervisor would probably
have many other bugs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:28 -04:00
Sean Christopherson
f8ae08f978 KVM: nVMX: Let userspace set nVMX MSR to any _host_ supported value
Restrict the nVMX MSRs based on KVM's config, not based on the guest's
current config.  Using the guest's config to audit the new config
prevents userspace from restoring the original config (KVM's config) if
at any point in the past the guest's config was restricted in any way.

Fixes: 62cc6b9dc6 ("KVM: nVMX: support restore of VMX capability MSRs")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:28 -04:00
Sean Christopherson
a645c2b506 KVM: nVMX: Rename handle_vm{on,off}() to handle_vmx{on,off}()
Rename the exit handlers for VMXON and VMXOFF to match the instruction
names, the terms "vmon" and "vmoff" are not used anywhere in Intel's
documentation, nor are they used elsehwere in KVM.

Sadly, the exit reasons are exposed to userspace and so cannot be renamed
without breaking userspace. :-(

Fixes: ec378aeef9 ("KVM: nVMX: Implement VMXON and VMXOFF")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:27 -04:00
Sean Christopherson
c7d855c2af KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4
Inject a #UD if L1 attempts VMXON with a CR0 or CR4 that is disallowed
per the associated nested VMX MSRs' fixed0/1 settings.  KVM cannot rely
on hardware to perform the checks, even for the few checks that have
higher priority than VM-Exit, as (a) KVM may have forced CR0/CR4 bits in
hardware while running the guest, (b) there may incompatible CR0/CR4 bits
that have lower priority than VM-Exit, e.g. CR0.NE, and (c) userspace may
have further restricted the allowed CR0/CR4 values by manipulating the
guest's nested VMX MSRs.

Note, despite a very strong desire to throw shade at Jim, commit
70f3aac964 ("kvm: nVMX: Remove superfluous VMX instruction fault checks")
is not to blame for the buggy behavior (though the comment...).  That
commit only removed the CR0.PE, EFLAGS.VM, and COMPATIBILITY mode checks
(though it did erroneously drop the CPL check, but that has already been
remedied).  KVM may force CR0.PE=1, but will do so only when also
forcing EFLAGS.VM=1 to emulate Real Mode, i.e. hardware will still #UD.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216033
Fixes: ec378aeef9 ("KVM: nVMX: Implement VMXON and VMXOFF")
Reported-by: Eric Li <ercli@ucdavis.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:26 -04:00
Sean Christopherson
ca58f3aa53 KVM: nVMX: Account for KVM reserved CR4 bits in consistency checks
Check that the guest (L2) and host (L1) CR4 values that would be loaded
by nested VM-Enter and VM-Exit respectively are valid with respect to
KVM's (L0 host) allowed CR4 bits.  Failure to check KVM reserved bits
would allow L1 to load an illegal CR4 (or trigger hardware VM-Fail or
failed VM-Entry) by massaging guest CPUID to allow features that are not
supported by KVM.  Amusingly, KVM itself is an accomplice in its doom, as
KVM adjusts L1's MSR_IA32_VMX_CR4_FIXED1 to allow L1 to enable bits for
L2 based on L1's CPUID model.

Note, although nested_{guest,host}_cr4_valid() are _currently_ used if
and only if the vCPU is post-VMXON (nested.vmxon == true), that may not
be true in the future, e.g. emulating VMXON has a bug where it doesn't
check the allowed/required CR0/CR4 bits.

Cc: stable@vger.kernel.org
Fixes: 3899152ccb ("KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:26 -04:00
Sean Christopherson
c33f6f2228 KVM: x86: Split kvm_is_valid_cr4() and export only the non-vendor bits
Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits
checks, into a separate helper, __kvm_is_valid_cr4(), and export only the
inner helper to vendor code in order to prevent nested VMX from calling
back into vmx_is_valid_cr4() via kvm_is_valid_cr4().

On SVM, this is a nop as SVM doesn't place any additional restrictions on
CR4.

On VMX, this is also currently a nop, but only because nested VMX is
missing checks on reserved CR4 bits for nested VM-Enter.  That bug will
be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is,
but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's
restrictions on CR0/CR4.  The cleanest and most intuitive way to fix the
VMXON bug is to use nested_host_cr{0,4}_valid().  If the CR4 variant
routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do
the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's
restrictions if and only if the vCPU is post-VMXON.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:25 -04:00
Sean Christopherson
85f44f8cc0 KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs
When zapping collapsible SPTEs in the TDP MMU, don't bottom out on a leaf
SPTE now that KVM doesn't require a PFN to compute the host mapping level,
i.e. now that there's no need to first find a leaf SPTE and then step
back up.

Drop the now unused tdp_iter_step_up(), as it is not the safest of
helpers (using any of the low level iterators requires some understanding
of the various side effects).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715232107.3775620-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:24 -04:00
Sean Christopherson
65e3b446bc KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level()
Add a comment to document how host_pfn_mapping_level() can be used safely,
as the line between safe and dangerous is quite thin.  E.g. if KVM were
to ever support in-place promotion to create huge pages, consuming the
level is safe if the caller holds mmu_lock and checks that there's an
existing _leaf_ SPTE, but unsafe if the caller only checks that there's a
non-leaf SPTE.

Opportunistically tweak the existing comments to explicitly document why
KVM needs to use READ_ONCE().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715232107.3775620-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:23 -04:00
Sean Christopherson
a8ac499bb6 KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs
Drop the requirement that a pfn be backed by a refcounted, compound or
or ZONE_DEVICE, struct page, and instead rely solely on the host page
tables to identify huge pages.  The PageCompound() check is a remnant of
an old implementation that identified (well, attempt to identify) huge
pages without walking the host page tables.  The ZONE_DEVICE check was
added as an exception to the PageCompound() requirement.  In other words,
neither check is actually a hard requirement, if the primary has a pfn
backed with a huge page, then KVM can back the pfn with a huge page
regardless of the backing store.

Dropping the @pfn parameter will also allow KVM to query the max host
mapping level without having to first get the pfn, which is advantageous
for use outside of the page fault path where KVM wants to take action if
and only if a page can be mapped huge, i.e. avoids the pfn lookup for
gfns that can't be backed with a huge page.

Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220715232107.3775620-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:22 -04:00
Sean Christopherson
d5e90a6998 KVM: x86/mmu: Restrict mapping level based on guest MTRR iff they're used
Restrict the mapping level for SPTEs based on the guest MTRRs if and only
if KVM may actually use the guest MTRRs to compute the "real" memtype.
For all forms of paging, guest MTRRs are purely virtual in the sense that
they are completely ignored by hardware, i.e. they affect the memtype
only if software manually consumes them.  The only scenario where KVM
consumes the guest MTRRs is when shadow_memtype_mask is non-zero and the
guest has non-coherent DMA, in all other cases KVM simply leaves the PAT
field in SPTEs as '0' to encode WB memtype.

Note, KVM may still ultimately ignore guest MTRRs, e.g. if the backing
pfn is host MMIO, but false positives are ok as they only cause a slight
performance blip (unless the guest is doing weird things with its MTRRs,
which is extremely unlikely).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220715230016.3762909-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:22 -04:00
Sean Christopherson
38bf9d7bf2 KVM: x86/mmu: Add shadow mask for effective host MTRR memtype
Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask
instead of relying on TDP being enabled, as NPT doesn't need a non-zero
mask.  This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero
for NPT anyways.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220715230016.3762909-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:21 -04:00
Sean Christopherson
82ffad2ddf KVM: x86: Drop unnecessary goto+label in kvm_arch_init()
Return directly if kvm_arch_init() detects an error before doing any real
work, jumping through a label obfuscates what's happening and carries the
unnecessary risk of leaving 'r' uninitialized.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220715230016.3762909-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:20 -04:00
Sean Christopherson
94bda2f4cd KVM: x86: Reject loading KVM if host.PAT[0] != WB
Reject KVM if entry '0' in the host's IA32_PAT MSR is not programmed to
writeback (WB) memtype.  KVM subtly relies on IA32_PAT entry '0' to be
programmed to WB by leaving the PAT bits in shadow paging and NPT SPTEs
as '0'.  If something other than WB is in PAT[0], at _best_ guests will
suffer very poor performance, and at worst KVM will crash the system by
breaking cache-coherency expecations (e.g. using WC for guest memory).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220715230016.3762909-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:20 -04:00
Suravee Suthikulpanit
01e69cef63 KVM: SVM: Fix x2APIC MSRs interception
The index for svm_direct_access_msrs was incorrectly initialized with
the APIC MMIO register macros. Fix by introducing a macro for calculating
x2APIC MSRs.

Fixes: 5c127c8547 ("KVM: SVM: Adding support for configuring x2APIC MSRs interception")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220718083833.222117-1-suravee.suthikulpanit@amd.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:19 -04:00
Sean Christopherson
3c2e10373e KVM: x86/mmu: Remove underscores from __pte_list_remove()
Remove the underscores from __pte_list_remove(), the function formerly
known as pte_list_remove() is now named kvm_zap_one_rmap_spte() to show
that it zaps rmaps/PTEs, i.e. doesn't just remove an entry from a list.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:18 -04:00
Sean Christopherson
9202aee816 KVM: x86/mmu: Rename pte_list_{destroy,remove}() to show they zap SPTEs
Rename pte_list_remove() and pte_list_destroy() to kvm_zap_one_rmap_spte()
and kvm_zap_all_rmap_sptes() respectively to document that (a) they zap
SPTEs and (b) to better document how they differ (remove vs. destroy does
not exactly scream "one vs. all").

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:18 -04:00
Sean Christopherson
f8480721a7 KVM: x86/mmu: Rename rmap zap helpers to eliminate "unmap" wrapper
Rename kvm_unmap_rmap() and kvm_zap_rmap() to kvm_zap_rmap() and
__kvm_zap_rmap() respectively to show that what was the "unmap" helper is
just a wrapper for the "zap" helper, i.e. that they do the exact same
thing, one just exists to deal with its caller passing in more params.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:17 -04:00
Sean Christopherson
2833eda0e2 KVM: x86/mmu: Rename __kvm_zap_rmaps() to align with other nomenclature
Rename __kvm_zap_rmaps() to kvm_rmap_zap_gfn_range() to avoid future
confusion with a soon-to-be-introduced __kvm_zap_rmap().  Using a plural
"rmaps" is somewhat ambiguous without additional context, as it's not
obvious whether it's referring to multiple rmap lists, versus multiple
rmap entries within a single list.

Use kvm_rmap_zap_gfn_range() to align with the pattern established by
kvm_rmap_zap_collapsible_sptes(), without losing the information that it
zaps only rmap-based MMUs, i.e. don't rename it to __kvm_zap_gfn_range().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:16 -04:00
Sean Christopherson
aed02fe3ca KVM: x86/mmu: Drop the "p is for pointer" from rmap helpers
Drop the trailing "p" from rmap helpers, i.e. rename functions to simply
be kvm_<action>_rmap().  Declaring that a function takes a pointer is
completely unnecessary and goes against kernel style.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:16 -04:00
Sean Christopherson
a42989e7fb KVM: x86/mmu: Directly "destroy" PTE list when recycling rmaps
Use pte_list_destroy() directly when recycling rmaps instead of bouncing
through kvm_unmap_rmapp() and kvm_zap_rmapp().  Calling kvm_unmap_rmapp()
is unnecessary and odd as it requires passing dummy parameters; passing
NULL for @slot when __rmap_add() already has a valid slot is especially
weird and confusing.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:15 -04:00
Sean Christopherson
35d539c3e4 KVM: x86/mmu: Return a u64 (the old SPTE) from mmu_spte_clear_track_bits()
Return a u64, not an int, from mmu_spte_clear_track_bits().  The return
value is the old SPTE value, which is very much a 64-bit value.  The sole
caller that consumes the return value, drop_spte(), already uses a u64.
The only reason that truncating the SPTE value is not problematic is
because drop_spte() only queries the shadow-present bit, which is in the
lower 32 bits.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220715224226.3749507-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:15 -04:00
Maciej S. Szmigiero
da0b93d65e KVM: nSVM: Pull CS.Base from actual VMCB12 for soft int/ex re-injection
enter_svm_guest_mode() first calls nested_vmcb02_prepare_control() to copy
control fields from VMCB12 to the current VMCB, then
nested_vmcb02_prepare_save() to perform a similar copy of the save area.

This means that nested_vmcb02_prepare_control() still runs with the
previous save area values in the current VMCB so it shouldn't take the L2
guest CS.Base from this area.

Explicitly pull CS.Base from the actual VMCB12 instead in
enter_svm_guest_mode().

Granted, having a non-zero CS.Base is a very rare thing (and even
impossible in 64-bit mode), having it change between nested VMRUNs is
probably even rarer, but if it happens it would create a really subtle bug
so it's better to fix it upfront.

Fixes: 6ef88d6e36 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <4caa0f67589ae3c22c311ee0e6139496902f2edc.1658159083.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:14 -04:00
Aaron Lewis
cf5029d5dd KVM: x86: Protect the unused bits in MSR exiting flags
The flags for KVM_CAP_X86_USER_SPACE_MSR and KVM_X86_SET_MSR_FILTER
have no protection for their unused bits.  Without protection, future
development for these features will be difficult.  Add the protection
needed to make it possible to extend these features in the future.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Message-Id: <20220714161314.1715227-1-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-19 14:04:18 -04:00
Paolo Bonzini
7962918160 KVM: emulate: do not adjust size of fastop and setcc subroutines
Instead of doing complicated calculations to find the size of the subroutines
(which are even more complicated because they need to be stringified into
an asm statement), just hardcode to 16.

It is less dense for a few combinations of IBT/SLS/retbleed, but it has
the advantage of being really simple.

Cc: stable@vger.kernel.org # 5.15.x: 84e7051c0b: x86/kvm: fix FASTOP_SIZE when return thunks are enabled
Cc: stable@vger.kernel.org
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-15 07:49:40 -04:00
Lu Baolu
bfd39a7387 KVM: x86: Remove unnecessary include
intel-iommu.h is not needed in kvm/x86 anymore. Remove its include.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20220514014322.2927339-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-07-15 10:21:30 +02:00
Sean Christopherson
8031d87aa9 KVM: x86: Check target, not vCPU's x2APIC ID, when applying hotplug hack
When applying the hotplug hack to match x2APIC IDs for vCPUs in xAPIC
mode, check the target APID ID for being unaddressable in xAPIC mode
instead of checking the vCPU's x2APIC ID, and in that case proceed as
if apic_x2apic_mode(vcpu) were true.

Functionally, it does not matter whether you compare kvm_x2apic_id(apic)
or mda with 0xff, since the two values are then checked for equality.
But in isolation, checking the x2APIC ID takes an unnecessary dependency
on the x2APIC ID being read-only (which isn't strictly true on AMD CPUs,
and is difficult to document as well); it also requires KVM to fallthrough
and check the xAPIC ID as well to deal with a writable xAPIC ID, whereas
the xAPIC ID _can't_ match a target ID greater than 0xff.

Opportunistically reword the comment to call out the various subtleties,
and to fix a typo reported by Zhang Jiaming.

No functional change intended.

Cc: Zhang Jiaming <jiaming@nfschina.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 12:26:30 -04:00
Vitaly Kuznetsov
8a414f943f KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()
'vector' and 'trig_mode' fields of 'struct kvm_lapic_irq' are left
uninitialized in kvm_pv_kick_cpu_op(). While these fields are normally
not needed for APIC_DM_REMRD, they're still referenced by
__apic_accept_irq() for trace_kvm_apic_accept_irq(). Fully initialize
the structure to avoid consuming random stack memory.

Fixes: a183b638b6 ("KVM: x86: make apic_accept_irq tracepoint more generic")
Reported-by: syzbot+d6caa905917d353f0d07@syzkaller.appspotmail.com
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220708125147.593975-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 12:09:43 -04:00
Sean Christopherson
ba28401bb9 KVM: x86: Restrict get_mt_mask() to a u8, use KVM_X86_OP_OPTIONAL_RET0
Restrict get_mt_mask() to a u8 and reintroduce using a RET0 static_call
for the SVM implementation.  EPT stores the memtype information in the
lower 8 bits (bits 6:3 to be precise), and even returns a shifted u8
without an explicit cast to a larger type; there's no need to return a
full u64.

Note, RET0 doesn't play nice with a u64 return on 32-bit kernels, see
commit bf07be36cd ("KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for
get_mt_mask").

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220714153707.3239119-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 11:43:12 -04:00
Sean Christopherson
277ad7d586 KVM: x86: Add dedicated helper to get CPUID entry with significant index
Add a second CPUID helper, kvm_find_cpuid_entry_index(), to handle KVM
queries for CPUID leaves whose index _may_ be significant, and drop the
index param from the existing kvm_find_cpuid_entry().  Add a WARN in the
inner helper, cpuid_entry2_find(), to detect attempts to retrieve a CPUID
entry whose index is significant without explicitly providing an index.

Using an explicit magic number and letting callers omit the index avoids
confusion by eliminating the myriad cases where KVM specifies '0' as a
dummy value.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 11:38:32 -04:00
Maxim Levitsky
bdc2d7ad10 KVM: SVM: fix task switch emulation on INTn instruction.
Recently KVM's SVM code switched to re-injecting software interrupt events,
if something prevented their delivery.

Task switch due to task gate in the IDT, however is an exception
to this rule, because in this case, INTn instruction causes
a task switch intercept and its emulation completes the INTn
emulation as well.

Add a missing case to task_switch_interception for that.

This fixes 32 bit kvm unit test taskswitch2.

Fixes: 7e5b5ef8dc ("KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"")

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <20220714124453.188655-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 11:32:00 -04:00
Sean Christopherson
dfd4eb444e KVM: x86/mmu: Fix typo and tweak comment for split_desc_cache capacity
Remove a spurious closing paranthesis and tweak the comment about the
cache capacity for PTE descriptors (rmaps) eager page splitting to tone
down the assertion slightly, and to call out that topup requires dropping
mmu_lock, which is the real motivation for avoiding topup (as opposed to
memory usage).

Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220712020724.1262121-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 11:31:25 -04:00
Sean Christopherson
39944ab99c KVM: x86/mmu: Expand quadrant comment for PG_LEVEL_4K shadow pages
Tweak the comment above the computation of the quadrant for PG_LEVEL_4K
shadow pages to explicitly call out how and why KVM uses role.quadrant to
consume gPTE bits.

Opportunistically wrap an unnecessarily long line.

No functional change intended.

Link: https://lore.kernel.org/all/YqvWvBv27fYzOFdE@google.com
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220712020724.1262121-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 11:31:24 -04:00
Sean Christopherson
79e48cec6c KVM: x86/mmu: Add optimized helper to retrieve an SPTE's index
Add spte_index() to dedup all the code that calculates a SPTE's index
into its parent's page table and/or spt array.  Opportunistically tweak
the calculation to avoid pointer arithmetic, which is subtle (subtract in
8-byte chunks) and less performant (requires the compiler to generate the
subtraction).

Suggested-by: David Matlack <dmatlack@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220712020724.1262121-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 11:31:23 -04:00
Paolo Bonzini
cca3f3381b Merge commit 'kvm-vmx-nested-tsc-fix' into kvm-master
Merge bugfix needed in both 5.19 (because it's bad) and 5.20 (because
it is a prerequisite to test new features).
2022-07-14 10:04:44 -04:00
Paolo Bonzini
1b870fa557 kvm: stats: tell userspace which values are boolean
Some of the statistics values exported by KVM are always only 0 or 1.
It can be useful to export this fact to userspace so that it can track
them specially (for example by polling the value every now and then to
compute a % of time spent in a specific state).

Therefore, add "boolean value" as a new "unit".  While it is not exactly
a unit, it walks and quacks like one.  In particular, using the type
would be wrong because boolean values could be instantaneous or peak
values (e.g. "is the rmap allocated?") or even two-bucket histograms
(e.g. "number of posted vs. non-posted interrupt injections").

Suggested-by: Amneesh Singh <natto@weirdnatto.in>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 08:01:59 -04:00
Thadeu Lima de Souza Cascardo
84e7051c0b x86/kvm: fix FASTOP_SIZE when return thunks are enabled
The return thunk call makes the fastop functions larger, just like IBT
does. Consider a 16-byte FASTOP_SIZE when CONFIG_RETHUNK is enabled.

Otherwise, functions will be incorrectly aligned and when computing their
position for differently sized operators, they will executed in the middle
or end of a function, which may as well be an int3, leading to a crash
like:

[   36.091116] int3: 0000 [#1] SMP NOPTI
[   36.091119] CPU: 3 PID: 1371 Comm: qemu-system-x86 Not tainted 5.15.0-41-generic #44
[   36.091120] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
[   36.091121] RIP: 0010:xaddw_ax_dx+0x9/0x10 [kvm]
[   36.091185] Code: 00 0f bb d0 c3 cc cc cc cc 48 0f bb d0 c3 cc cc cc cc 0f 1f 80 00 00 00 00 0f c0 d0 c3 cc cc cc cc 66 0f c1 d0 c3 cc cc cc cc <0f> 1f 80 00 00 00 00 0f c1 d0 c3 cc cc cc cc 48 0f c1 d0 c3 cc cc
[   36.091186] RSP: 0018:ffffb1f541143c98 EFLAGS: 00000202
[   36.091188] RAX: 0000000089abcdef RBX: 0000000000000001 RCX: 0000000000000000
[   36.091188] RDX: 0000000076543210 RSI: ffffffffc073c6d0 RDI: 0000000000000200
[   36.091189] RBP: ffffb1f541143ca0 R08: ffff9f1803350a70 R09: 0000000000000002
[   36.091190] R10: ffff9f1803350a70 R11: 0000000000000000 R12: ffff9f1803350a70
[   36.091190] R13: ffffffffc077fee0 R14: 0000000000000000 R15: 0000000000000000
[   36.091191] FS:  00007efdfce8d640(0000) GS:ffff9f187dd80000(0000) knlGS:0000000000000000
[   36.091192] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   36.091192] CR2: 0000000000000000 CR3: 0000000009b62002 CR4: 0000000000772ee0
[   36.091195] PKRU: 55555554
[   36.091195] Call Trace:
[   36.091197]  <TASK>
[   36.091198]  ? fastop+0x5a/0xa0 [kvm]
[   36.091222]  x86_emulate_insn+0x7b8/0xe90 [kvm]
[   36.091244]  x86_emulate_instruction+0x2f4/0x630 [kvm]
[   36.091263]  ? kvm_arch_vcpu_load+0x7c/0x230 [kvm]
[   36.091283]  ? vmx_prepare_switch_to_host+0xf7/0x190 [kvm_intel]
[   36.091290]  complete_emulated_mmio+0x297/0x320 [kvm]
[   36.091310]  kvm_arch_vcpu_ioctl_run+0x32f/0x550 [kvm]
[   36.091330]  kvm_vcpu_ioctl+0x29e/0x6d0 [kvm]
[   36.091344]  ? kvm_vcpu_ioctl+0x120/0x6d0 [kvm]
[   36.091357]  ? __fget_files+0x86/0xc0
[   36.091362]  ? __fget_files+0x86/0xc0
[   36.091363]  __x64_sys_ioctl+0x92/0xd0
[   36.091366]  do_syscall_64+0x59/0xc0
[   36.091369]  ? syscall_exit_to_user_mode+0x27/0x50
[   36.091370]  ? do_syscall_64+0x69/0xc0
[   36.091371]  ? syscall_exit_to_user_mode+0x27/0x50
[   36.091372]  ? __x64_sys_writev+0x1c/0x30
[   36.091374]  ? do_syscall_64+0x69/0xc0
[   36.091374]  ? exit_to_user_mode_prepare+0x37/0xb0
[   36.091378]  ? syscall_exit_to_user_mode+0x27/0x50
[   36.091379]  ? do_syscall_64+0x69/0xc0
[   36.091379]  ? do_syscall_64+0x69/0xc0
[   36.091380]  ? do_syscall_64+0x69/0xc0
[   36.091381]  ? do_syscall_64+0x69/0xc0
[   36.091381]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[   36.091384] RIP: 0033:0x7efdfe6d1aff
[   36.091390] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
[   36.091391] RSP: 002b:00007efdfce8c460 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   36.091393] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007efdfe6d1aff
[   36.091393] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000c
[   36.091394] RBP: 0000558f1609e220 R08: 0000558f13fb8190 R09: 00000000ffffffff
[   36.091394] R10: 0000558f16b5e950 R11: 0000000000000246 R12: 0000000000000000
[   36.091394] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[   36.091396]  </TASK>
[   36.091397] Modules linked in: isofs nls_iso8859_1 kvm_intel joydev kvm input_leds serio_raw sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler drm msr ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net net_failover crypto_simd ahci xhci_pci cryptd psmouse virtio_blk libahci xhci_pci_renesas failover
[   36.123271] ---[ end trace db3c0ab5a48fabcc ]---
[   36.123272] RIP: 0010:xaddw_ax_dx+0x9/0x10 [kvm]
[   36.123319] Code: 00 0f bb d0 c3 cc cc cc cc 48 0f bb d0 c3 cc cc cc cc 0f 1f 80 00 00 00 00 0f c0 d0 c3 cc cc cc cc 66 0f c1 d0 c3 cc cc cc cc <0f> 1f 80 00 00 00 00 0f c1 d0 c3 cc cc cc cc 48 0f c1 d0 c3 cc cc
[   36.123320] RSP: 0018:ffffb1f541143c98 EFLAGS: 00000202
[   36.123321] RAX: 0000000089abcdef RBX: 0000000000000001 RCX: 0000000000000000
[   36.123321] RDX: 0000000076543210 RSI: ffffffffc073c6d0 RDI: 0000000000000200
[   36.123322] RBP: ffffb1f541143ca0 R08: ffff9f1803350a70 R09: 0000000000000002
[   36.123322] R10: ffff9f1803350a70 R11: 0000000000000000 R12: ffff9f1803350a70
[   36.123323] R13: ffffffffc077fee0 R14: 0000000000000000 R15: 0000000000000000
[   36.123323] FS:  00007efdfce8d640(0000) GS:ffff9f187dd80000(0000) knlGS:0000000000000000
[   36.123324] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   36.123325] CR2: 0000000000000000 CR3: 0000000009b62002 CR4: 0000000000772ee0
[   36.123327] PKRU: 55555554
[   36.123328] Kernel panic - not syncing: Fatal exception in interrupt
[   36.123410] Kernel Offset: 0x1400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   36.135305] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Fixes: aa3d480315 ("x86: Use return-thunk in asm code")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Message-Id: <20220713171241.184026-1-cascardo@canonical.com>
Tested-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 07:44:38 -04:00
Vitaly Kuznetsov
9948272645 KVM: nVMX: Always enable TSC scaling for L2 when it was enabled for L1
Windows 10/11 guests with Hyper-V role (WSL2) enabled are observed to
hang upon boot or shortly after when a non-default TSC frequency was
set for L1. The issue is observed on a host where TSC scaling is
supported. The problem appears to be that Windows doesn't use TSC
frequency for its guests even when the feature is advertised and KVM
filters SECONDARY_EXEC_TSC_SCALING out when creating L2 controls from
L1's. This leads to L2 running with the default frequency (matching
host's) while L1 is running with an altered one.

Keep SECONDARY_EXEC_TSC_SCALING in secondary exec controls for L2 when
it was set for L1. TSC_MULTIPLIER is already correctly computed and
written by prepare_vmcs02().

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220712135009.952805-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 07:44:05 -04:00
Sean Christopherson
b184b35d06 KVM: VMX: Update PT MSR intercepts during filter change iff PT in host+guest
Update the Processor Trace (PT) MSR intercepts during a filter change if
and only if PT may be exposed to the guest, i.e. only if KVM is operating
in the so called "host+guest" mode where PT can be used simultaneously by
both the host and guest.  If PT is in system mode, the host is the sole
owner of PT and the MSRs should never be passed through to the guest.

Luckily the missed check only results in unnecessary work, as select RTIT
MSRs are passed through only when RTIT tracing is enabled "in" the guest,
and tracing can't be enabled in the guest when KVM is in system mode
(writes to guest.MSR_IA32_RTIT_CTL are disallowed).

Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20220712015838.1253995-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-13 18:14:25 -07:00
Sean Christopherson
0bc2732661 KVM: x86: WARN only once if KVM leaves a dangling userspace I/O request
Change a WARN_ON() to separate WARN_ON_ONCE() if KVM has an outstanding
PIO or MMIO request without an associated callback, i.e. if KVM queued a
userspace I/O exit but didn't actually exit to userspace before moving
on to something else.  Warning on every KVM_RUN risks spamming the kernel
if KVM gets into a bad state.  Opportunistically split the WARNs so that
it's easier to triage failures when a WARN fires.

Deliberately do not use KVM_BUG_ON(), i.e. don't kill the VM.  While the
WARN is all but guaranteed to fire if and only if there's a KVM bug, a
dangling I/O request does not present a danger to KVM (that flag is truly
truly consumed only in a single emulator path), and any such bug is
unlikely to be fatal to the VM (KVM essentially failed to do something it
shouldn't have tried to do in the first place).  In other words, note the
bug, but let the VM keep running.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220711232750.1092012-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-13 18:14:06 -07:00
Sean Christopherson
2626206963 KVM: x86: Set error code to segment selector on LLDT/LTR non-canonical #GP
When injecting a #GP on LLDT/LTR due to a non-canonical LDT/TSS base, set
the error code to the selector.  Intel SDM's says nothing about the #GP,
but AMD's APM explicitly states that both LLDT and LTR set the error code
to the selector, not zero.

Note, a non-canonical memory operand on LLDT/LTR does generate a #GP(0),
but the KVM code in question is specific to the base from the descriptor.

Fixes: e37a75a13c ("KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220711232750.1092012-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-13 18:14:06 -07:00
Sean Christopherson
ec6e4d8632 KVM: x86: Mark TSS busy during LTR emulation _after_ all fault checks
Wait to mark the TSS as busy during LTR emulation until after all fault
checks for the LTR have passed.  Specifically, don't mark the TSS busy if
the new TSS base is non-canonical.

Opportunistically drop the one-off !seg_desc.PRESENT check for TR as the
only reason for the early check was to avoid marking a !PRESENT TSS as
busy, i.e. the common !PRESENT is now done before setting the busy bit.

Fixes: e37a75a13c ("KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR")
Reported-by: syzbot+760a73552f47a8cd0fd9@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220711232750.1092012-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-13 18:14:05 -07:00
Sean Christopherson
43bb9e000e KVM: x86: Tweak name of MONITOR/MWAIT #UD quirk to make it #UD specific
Add a "UD" clause to KVM_X86_QUIRK_MWAIT_NEVER_FAULTS to make it clear
that the quirk only controls the #UD behavior of MONITOR/MWAIT.  KVM
doesn't currently enforce fault checks when MONITOR/MWAIT are supported,
but that could change in the future.  SVM also has a virtualization hole
in that it checks all faults before intercepts, and so "never faults" is
already a lie when running on SVM.

Fixes: bfbcc81bb8 ("KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behavior")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220711225753.1073989-4-seanjc@google.com
2022-07-13 18:14:05 -07:00
Sean Christopherson
79f772b9e8 KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor, again
Read vcpu->vcpu_idx directly instead of bouncing through the one-line
wrapper, kvm_vcpu_get_idx(), and drop the wrapper.  The wrapper is a
remnant of the original implementation and serves no purpose; remove it
(again) before it gains more users.

kvm_vcpu_get_idx() was removed in the not-too-distant past by commit
4eeef24241 ("KVM: x86: Query vcpu->vcpu_idx directly and drop its
accessor"), but was unintentionally re-introduced by commit a54d806688
("KVM: Keep memslots in tree-based structures instead of array-based ones"),
likely due to a rebase goof.  The wrapper then managed to gain users in
KVM's Xen code.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20220614225615.3843835-1-seanjc@google.com
2022-07-12 22:31:13 +00:00
Hou Wenlong
6e1d2a3f25 KVM: x86/mmu: Replace UNMAPPED_GVA with INVALID_GPA for gva_to_gpa()
The result of gva_to_gpa() is physical address not virtual address,
it is odd that UNMAPPED_GVA macro is used as the result for physical
address. Replace UNMAPPED_GVA with INVALID_GPA and drop UNMAPPED_GVA
macro.

No functional change intended.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/6104978956449467d3c68f1ad7f2c2f6d771d0ee.1656667239.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-12 22:31:12 +00:00
Vitaly Kuznetsov
156b9d76e8 KVM: nVMX: Always enable TSC scaling for L2 when it was enabled for L1
Windows 10/11 guests with Hyper-V role (WSL2) enabled are observed to
hang upon boot or shortly after when a non-default TSC frequency was
set for L1. The issue is observed on a host where TSC scaling is
supported. The problem appears to be that Windows doesn't use TSC
scaling for its guests, even when the feature is advertised, and KVM
filters SECONDARY_EXEC_TSC_SCALING out when creating L2 controls from
L1's VMCS. This leads to L2 running with the default frequency (matching
host's) while L1 is running with an altered one.

Keep SECONDARY_EXEC_TSC_SCALING in secondary exec controls for L2 when
it was set for L1. TSC_MULTIPLIER is already correctly computed and
written by prepare_vmcs02().

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: d041b5ea93 ("KVM: nVMX: Enable nested TSC scaling")
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220712135009.952805-1-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-12 12:06:20 -07:00
Vitaly Kuznetsov
159e037d2e KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()
'vector' and 'trig_mode' fields of 'struct kvm_lapic_irq' are left
uninitialized in kvm_pv_kick_cpu_op(). While these fields are normally
not needed for APIC_DM_REMRD, they're still referenced by
__apic_accept_irq() for trace_kvm_apic_accept_irq(). Fully initialize
the structure to avoid consuming random stack memory.

Fixes: a183b638b6 ("KVM: x86: make apic_accept_irq tracepoint more generic")
Reported-by: syzbot+d6caa905917d353f0d07@syzkaller.appspotmail.com
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220708125147.593975-1-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-08 15:59:28 -07:00
Sean Christopherson
f83894b24c KVM: x86: Fix handling of APIC LVT updates when userspace changes MCG_CAP
Add a helper to update KVM's in-kernel local APIC in response to MCG_CAP
being changed by userspace to fix multiple bugs.  First and foremost,
KVM needs to check that there's an in-kernel APIC prior to dereferencing
vcpu->arch.apic.  Beyond that, any "new" LVT entries need to be masked,
and the APIC version register needs to be updated as it reports out the
number of LVT entries.

Fixes: 4b903561ec ("KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic.")
Reported-by: syzbot+8cdad6430c24f396f158@syzkaller.appspotmail.com
Cc: Siddh Raman Pant <code@siddh.me>
Cc: Jue Wang <juew@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-08 15:58:16 -07:00
Sean Christopherson
03d84f9689 KVM: x86: Initialize number of APIC LVT entries during APIC creation
Initialize the number of LVT entries during APIC creation, else the field
will be incorrectly left '0' if userspace never invokes KVM_X86_SETUP_MCE.

Add and use a helper to calculate the number of entries even though
MCG_CMCI_P is not set by default in vcpu->arch.mcg_cap.  Relying on that
to always be true is unnecessarily risky, and subtle/confusing as well.

Fixes: 4b903561ec ("KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic.")
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Jue Wang <juew@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-08 15:55:28 -07:00
Sean Christopherson
4a627b0b16 Merge branch 'kvm-5.20-msr-eperm'
Merge a bug fix and cleanups for {g,s}et_msr_mce() using a base that
predates commit 281b52780b ("KVM: x86: Add emulation for
MSR_IA32_MCx_CTL2 MSRs."), which was written with the intention that it
be applied _after_ the bug fix and cleanups.  The bug fix in particular
needs to be sent to stable trees; give them a stable hash to use.
2022-07-08 15:02:41 -07:00
Sean Christopherson
54ad60ba9d KVM: x86: Add helpers to identify CTL and STATUS MCi MSRs
Add helpers to identify CTL (control) and STATUS MCi MSR types instead of
open coding the checks using the offset.  Using the offset is perfectly
safe, but unintuitive, as understanding what the code does requires
knowing that the offset calcuation will not affect the lower three bits.

Opportunistically comment the STATUS logic to save readers a trip to
Intel's SDM or AMD's APM to understand the "data != 0" check.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20220512222716.4112548-4-seanjc@google.com
2022-07-08 14:57:20 -07:00
Sean Christopherson
f5223a332f KVM: x86: Use explicit case-statements for MCx banks in {g,s}et_msr_mce()
Use an explicit case statement to grab the full range of MCx bank MSRs
in {g,s}et_msr_mce(), and manually check only the "end" (the number of
banks configured by userspace may be less than the max).  The "default"
trick works, but is a bit odd now, and will be quite odd if/when support
for accessing MCx_CTL2 MSRs is added, which has near identical logic.

Hoist "offset" to function scope so as to avoid curly braces for the case
statement, and because MCx_CTL2 support will need the same variables.

Opportunstically clean up the comment about allowing bit 10 to be cleared
from bank 4.

No functional change intended.

Cc: Jue Wang <juew@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20220512222716.4112548-3-seanjc@google.com
2022-07-08 14:57:12 -07:00
Sean Christopherson
2368048bf5 KVM: x86: Signal #GP, not -EPERM, on bad WRMSR(MCi_CTL/STATUS)
Return '1', not '-1', when handling an illegal WRMSR to a MCi_CTL or
MCi_STATUS MSR.  The behavior of "all zeros' or "all ones" for CTL MSRs
is architectural, as is the "only zeros" behavior for STATUS MSRs.  I.e.
the intent is to inject a #GP, not exit to userspace due to an unhandled
emulation case.  Returning '-1' gets interpreted as -EPERM up the stack
and effecitvely kills the guest.

Fixes: 890ca9aefa ("KVM: Add MCE support")
Fixes: 9ffd986c6e ("KVM: X86: #GP when guest attempts to write MCi_STATUS register w/o 0")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20220512222716.4112548-2-seanjc@google.com
2022-07-08 14:52:59 -07:00
Roman Gushchin
e33c267ab7 mm: shrinkers: provide shrinkers with names
Currently shrinkers are anonymous objects.  For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g.  for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.

This commit adds names to shrinkers.  register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.

In some cases it's not possible to determine a good name at the time when
a shrinker is allocated.  For such cases shrinker_debugfs_rename() is
provided.

The expected format is:
    <subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

After this change the shrinker debugfs directory looks like:
  $ cd /sys/kernel/debug/shrinker/
  $ ls
    dquota-cache-16     sb-devpts-28     sb-proc-47       sb-tmpfs-42
    mm-shadow-18        sb-devtmpfs-5    sb-proc-48       sb-tmpfs-43
    mm-zspool:zram0-34  sb-hugetlbfs-17  sb-pstore-31     sb-tmpfs-44
    rcu-kfree-0         sb-hugetlbfs-33  sb-rootfs-2      sb-tmpfs-49
    sb-aio-20           sb-iomem-12      sb-securityfs-6  sb-tracefs-13
    sb-anon_inodefs-15  sb-mqueue-21     sb-selinuxfs-22  sb-xfs:vda1-36
    sb-bdev-3           sb-nsfs-4        sb-sockfs-8      sb-zsmalloc-19
    sb-bpf-32           sb-pipefs-14     sb-sysfs-26      thp-deferred_split-10
    sb-btrfs:vda2-24    sb-proc-25       sb-tmpfs-1       thp-zero-9
    sb-cgroup2-30       sb-proc-39       sb-tmpfs-27      xfs-buf:vda1-37
    sb-configfs-23      sb-proc-41       sb-tmpfs-29      xfs-inodegc:vda1-38
    sb-dax-11           sb-proc-45       sb-tmpfs-35
    sb-debugfs-7        sb-proc-46       sb-tmpfs-40

[roman.gushchin@linux.dev: fix build warnings]
  Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
  Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 18:08:40 -07:00
Peter Zijlstra
f43b9876e8 x86/retbleed: Add fine grained Kconfig knobs
Do fine-grained Kconfig for all the various retbleed parts.

NOTE: if your compiler doesn't support return thunks this will
silently 'upgrade' your mitigation to IBPB, you might not like this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-29 17:43:41 +02:00
Josh Poimboeuf
07853adc29 KVM: VMX: Prevent RSB underflow before vmenter
On VMX, there are some balanced returns between the time the guest's
SPEC_CTRL value is written, and the vmenter.

Balanced returns (matched by a preceding call) are usually ok, but it's
at least theoretically possible an NMI with a deep call stack could
empty the RSB before one of the returns.

For maximum paranoia, don't allow *any* returns (balanced or otherwise)
between the SPEC_CTRL write and the vmenter.

  [ bp: Fix 32-bit build. ]

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:34:00 +02:00
Josh Poimboeuf
9756bba284 x86/speculation: Fill RSB on vmexit for IBRS
Prevent RSB underflow/poisoning attacks with RSB.  While at it, add a
bunch of comments to attempt to document the current state of tribal
knowledge about RSB attacks and what exactly is being mitigated.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:34:00 +02:00
Josh Poimboeuf
bea7e31a5c KVM: VMX: Fix IBRS handling after vmexit
For legacy IBRS to work, the IBRS bit needs to be always re-written
after vmexit, even if it's already on.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:34:00 +02:00
Josh Poimboeuf
fc02735b14 KVM: VMX: Prevent guest RSB poisoning attacks with eIBRS
On eIBRS systems, the returns in the vmexit return path from
__vmx_vcpu_run() to vmx_vcpu_run() are exposed to RSB poisoning attacks.

Fix that by moving the post-vmexit spec_ctrl handling to immediately
after the vmexit.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:34:00 +02:00
Josh Poimboeuf
bb06650634 KVM: VMX: Convert launched argument to flags
Convert __vmx_vcpu_run()'s 'launched' argument to 'flags', in
preparation for doing SPEC_CTRL handling immediately after vmexit, which
will need another flag.

This is much easier than adding a fourth argument, because this code
supports both 32-bit and 64-bit, and the fourth argument on 32-bit would
have to be pushed on the stack.

Note that __vmx_vcpu_run_flags() is called outside of the noinstr
critical section because it will soon start calling potentially
traceable functions.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:34:00 +02:00
Josh Poimboeuf
8bd200d23e KVM: VMX: Flatten __vmx_vcpu_run()
Move the vmx_vm{enter,exit}() functionality into __vmx_vcpu_run().  This
will make it easier to do the spec_ctrl handling before the first RET.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:34:00 +02:00
Peter Zijlstra
a149180fbc x86: Add magic AMD return-thunk
Note: needs to be in a section distinct from Retpolines such that the
Retpoline RET substitution cannot possibly use immediate jumps.

ORC unwinding for zen_untrain_ret() and __x86_return_thunk() is a
little tricky but works due to the fact that zen_untrain_ret() doesn't
have any stack ops and as such will emit a single ORC entry at the
start (+0x3f).

Meanwhile, unwinding an IP, including the __x86_return_thunk() one
(+0x40) will search for the largest ORC entry smaller or equal to the
IP, these will find the one ORC entry (+0x3f) and all works.

  [ Alexandre: SVM part. ]
  [ bp: Build fix, massages. ]

Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:33:59 +02:00
Peter Zijlstra
af2e140f34 x86/kvm: Fix SETcc emulation for return thunks
Prepare the SETcc fastop stuff for when RET can be larger still.

The tricky bit here is that the expressions should not only be
constant C expressions, but also absolute GAS expressions. This means
no ?: and 'true' is ~0.

Also ensure em_setcc() has the same alignment as the actual FOP_SETCC()
ops, this ensures there cannot be an alignment hole between em_setcc()
and the first op.

Additionally, add a .skip directive to the FOP_SETCC() macro to fill
any remaining space with INT3 traps; however the primary purpose of
this directive is to generate AS warnings when the remaining space
goes negative. Which is a very good indication the alignment magic
went side-ways.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:33:58 +02:00
Peter Zijlstra
742ab6df97 x86/kvm/vmx: Make noinstr clean
The recent mmio_stale_data fixes broke the noinstr constraints:

  vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x15b: call to wrmsrl.constprop.0() leaves .noinstr.text section
  vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x1bf: call to kvm_arch_has_assigned_device() leaves .noinstr.text section

make it all happy again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:33:58 +02:00
Sean Christopherson
b9b71f4368 KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Buffer split_desc_cache, the cache used to allcoate rmap list entries,
only by the default cache capacity (currently 40), not by doubling the
minimum (513).  Aliasing L2 GPAs to L1 GPAs is uncommon, thus eager page
splitting is unlikely to need 500+ entries.  And because each object is a
non-trivial 128 bytes (see struct pte_list_desc), those extra ~500
entries means KVM is in all likelihood wasting ~64kb of memory per VM.

Link: https://lore.kernel.org/all/YrTDcrsn0%2F+alpzf@google.com
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220624171808.2845941-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-25 04:54:51 -04:00
Sean Christopherson
72ae5822b8 KVM: x86/mmu: Use "unsigned int", not "u32", for SPTEs' @access info
Use an "unsigned int" for @access parameters instead of a "u32", mostly
to be consistent throughout KVM, but also because "u32" is misleading.
@access can actually squeeze into a u8, i.e. doesn't need 32 bits, but is
as an "unsigned int" because sp->role.access is an unsigned int.

No functional change intended.

Link: https://lore.kernel.org/all/YqyZxEfxXLsHGoZ%2F@google.com
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220624171808.2845941-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-25 04:54:40 -04:00
Paolo Bonzini
db209369d4 KVM: SEV-ES: reuse advance_sev_es_emulated_ins for OUT too
complete_emulator_pio_in() only has to be called by
complete_sev_es_emulated_ins() now; therefore, all that the function does
now is adjust sev_pio_count and sev_pio_data.  Which is the same for
both IN and OUT.

No functional change intended.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 13:05:35 -04:00
Paolo Bonzini
f35cee4adb KVM: x86: de-underscorify __emulator_pio_in
Now all callers except emulator_pio_in_emulated are using
__emulator_pio_in/complete_emulator_pio_in explicitly.
Move the "either copy the result or attempt PIO" logic in
emulator_pio_in_emulated, and rename __emulator_pio_in to
just emulator_pio_in.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:54:33 -04:00
Paolo Bonzini
dc7a4bfde5 KVM: x86: wean fast IN from emulator_pio_in
Use __emulator_pio_in() directly for fast PIO instead of bouncing through
emulator_pio_in() now that __emulator_pio_in() fills "val" when handling
in-kernel PIO.  vcpu->arch.pio.count is guaranteed to be '0', so this a
pure nop.

emulator_pio_in_emulated is now the last caller of emulator_pio_in.

No functional change intended.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:54:20 -04:00
Paolo Bonzini
0c05e10bce KVM: x86: wean in-kernel PIO from vcpu->arch.pio*
Make emulator_pio_in_out operate directly on the provided buffer
as long as PIO is handled inside KVM.

For input operations, this means that, in the case of in-kernel
PIO, __emulator_pio_in() does not have to be always followed
by complete_emulator_pio_in().  This affects emulator_pio_in() and
kvm_sev_es_ins(); for the latter, that is why the call moves from
advance_sev_es_emulated_ins() to complete_sev_es_emulated_ins().

For output, it means that vcpu->pio.count is never set unnecessarily
and there is no need to clear it; but also vcpu->pio.size must not
be used in kvm_sev_es_outs(), because it will not be updated for
in-kernel OUT.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:54:04 -04:00
Paolo Bonzini
30d583fd4e KVM: x86: move all vcpu->arch.pio* setup in emulator_pio_in_out()
For now, this is basically an excuse to add back the void* argument to
the function, while removing some knowledge of vcpu->arch.pio* from
its callers.  The WARN that vcpu->arch.pio.count is zero is also
extended to OUT operations.

The vcpu->arch.pio* fields still need to be filled even when the PIO is
handled in-kernel as __emulator_pio_in() is always followed by
complete_emulator_pio_in().  But after fixing that, it will be possible to
to only populate the vcpu->arch.pio* fields on userspace exits.

No functional change intended.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:53:50 -04:00
Paolo Bonzini
35ab3b77a0 KVM: x86: drop PIO from unregistered devices
KVM protects the device list with SRCU, and therefore different calls
to kvm_io_bus_read()/kvm_io_bus_write() can very well see different
incarnations of kvm->buses.  If userspace unregisters a device while
vCPUs are running there is no well-defined result.  This patch applies
a safe fallback by returning early from emulator_pio_in_out().  This
corresponds to returning zeroes from IN, and dropping the writes on
the floor for OUT.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:53:37 -04:00
Paolo Bonzini
0f87ac234d KVM: x86: inline kernel_pio into its sole caller
The caller of kernel_pio already has arguments for most of what kernel_pio
fishes out of vcpu->arch.pio.  This is the first step towards ensuring that
vcpu->arch.pio.* is only used when exiting to userspace.

No functional change intended.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:53:23 -04:00
Paolo Bonzini
7a6177d6f3 KVM: x86: complete fast IN directly with complete_emulator_pio_in()
Use complete_emulator_pio_in() directly when completing fast PIO, there's
no need to bounce through emulator_pio_in(): the comment about ECX
changing doesn't apply to fast PIO, which isn't used for string I/O.

No functional change intended.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:53:07 -04:00
Maxim Levitsky
091abbf578 KVM: x86: nSVM: optimize svm_set_x2apic_msr_interception
- Avoid toggling the x2apic msr interception if it is already up to date.

- Avoid touching L0 msr bitmap when AVIC is inhibited on entry to
  the guest mode, because in this case the guest usually uses its
  own msr bitmap.

  Later on VM exit, the 1st optimization will allow KVM to skip
  touching the L0 msr bitmap as well.

Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220519102709.24125-18-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:52:59 -04:00
Suravee Suthikulpanit
39b6b8c35c KVM: SVM: Add AVIC doorbell tracepoint
Add a tracepoint to track number of doorbells being sent
to signal a running vCPU to process IRQ after being injected.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-17-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:52:45 -04:00
Suravee Suthikulpanit
8c9e639da4 KVM: SVM: Use target APIC ID to complete x2AVIC IRQs when possible
For x2AVIC, the index from incomplete IPI #vmexit info is invalid
for logical cluster mode. Only ICRH/ICRL values can be used
to determine the IPI destination APIC ID.

Since QEMU defines guest physical APIC ID to be the same as
vCPU ID, it can be used to quickly identify the target vCPU to deliver IPI,
and avoid the overhead from searching through all vCPUs to match the target
vCPU.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-16-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:52:18 -04:00
Suravee Suthikulpanit
f8d8ac2159 KVM: x86: Warning APICv inconsistency only when vcpu APIC mode is valid
When launching a VM with x2APIC and specify more than 255 vCPUs,
the guest kernel can disable x2APIC (e.g. specify nox2apic kernel option).
The VM fallbacks to xAPIC mode, and disable the vCPU ID 255 and greater.

In this case, APICV is deactivated for the disabled vCPUs.
However, the current APICv consistency warning does not account for
this case, which results in a warning.

Therefore, modify warning logic to report only when vCPU APIC mode
is valid.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-15-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:51:15 -04:00
Suravee Suthikulpanit
0e311d33bf KVM: SVM: Introduce hybrid-AVIC mode
Currently, AVIC is inhibited when booting a VM w/ x2APIC support.
because AVIC cannot virtualize x2APIC MSR register accesses.
However, the AVIC doorbell can be used to accelerate interrupt
injection into a running vCPU, while all guest accesses to x2APIC MSRs
will be intercepted and emulated by KVM.

With hybrid-AVIC support, the APICV_INHIBIT_REASON_X2APIC is
no longer enforced.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-14-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:51:00 -04:00
Suravee Suthikulpanit
c0caeee65a KVM: SVM: Do not throw warning when calling avic_vcpu_load on a running vcpu
Originalliy, this WARN_ON is designed to detect when calling
avic_vcpu_load() on an already running vcpu in AVIC mode (i.e. the AVIC
is_running bit is set).

However, for x2AVIC, the vCPU can switch from xAPIC to x2APIC mode while in
running state, in which the avic_vcpu_load() will be called from
svm_refresh_apicv_exec_ctrl().

Therefore, remove this warning since it is no longer appropriate.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-13-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:50:49 -04:00
Suravee Suthikulpanit
4d1d7942e3 KVM: SVM: Introduce logic to (de)activate x2AVIC mode
Introduce logic to (de)activate AVIC, which also allows
switching between AVIC to x2AVIC mode at runtime.

When an AVIC-enabled guest switches from APIC to x2APIC mode,
the SVM driver needs to perform the following steps:

1. Set the x2APIC mode bit for AVIC in VMCB along with the maximum
APIC ID support for each mode accodingly.

2. Disable x2APIC MSRs interception in order to allow the hardware
to virtualize x2APIC MSRs accesses.

Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-12-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:50:35 -04:00
Maxim Levitsky
7a8f7c1f34 KVM: x86: nSVM: always intercept x2apic msrs
As a preparation for x2avic, this patch ensures that x2apic msrs
are always intercepted for the nested guest.

Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220519102709.24125-11-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:50:26 -04:00
Suravee Suthikulpanit
05c4fe8c1b KVM: SVM: Refresh AVIC configuration when changing APIC mode
AMD AVIC can support xAPIC and x2APIC virtualization,
which requires changing x2APIC bit VMCB and MSR intercepton
for x2APIC MSRs. Therefore, call avic_refresh_apicv_exec_ctrl()
to refresh configuration accordingly.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-10-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:50:17 -04:00
Suravee Suthikulpanit
8fc9c7a307 KVM: x86: Deactivate APICv on vCPU with APIC disabled
APICv should be deactivated on vCPU that has APIC disabled.
Therefore, call kvm_vcpu_update_apicv() when changing
APIC mode, and add additional check for APIC disable mode
when determine APICV activation,

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-9-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:45:51 -04:00
Suravee Suthikulpanit
5c127c8547 KVM: SVM: Adding support for configuring x2APIC MSRs interception
When enabling x2APIC virtualization (x2AVIC), the interception of
x2APIC MSRs must be disabled to let the hardware virtualize guest
MSR accesses.

Current implementation keeps track of list of MSR interception state
in the svm_direct_access_msrs array. Therefore, extends the array to
include x2APIC MSRs.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-8-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:45:41 -04:00
Suravee Suthikulpanit
ab1b1dc131 KVM: SVM: Do not support updating APIC ID when in x2APIC mode
In X2APIC mode, the Logical Destination Register is read-only,
which provides a fixed mapping between the logical and physical
APIC IDs. Therefore, there is no Logical APIC ID table in X2AVIC
and the processor uses the X2APIC ID in the backing page to create
a vCPU’s logical ID.

In addition, KVM does not support updating APIC ID in x2APIC mode,
which means AVIC does not need to handle this case.

Therefore, check x2APIC mode when handling physical and logical
APIC ID update, and when invalidating logical APIC ID table.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-7-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:45:29 -04:00
Suravee Suthikulpanit
c514d3a348 KVM: SVM: Update avic_kick_target_vcpus to support 32-bit APIC ID
In x2APIC mode, ICRH contains 32-bit destination APIC ID.
So, update the avic_kick_target_vcpus() accordingly.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-6-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:45:21 -04:00
Suravee Suthikulpanit
d2fe6bf5b8 KVM: SVM: Update max number of vCPUs supported for x2AVIC mode
xAVIC and x2AVIC modes can support diffferent number of vcpus.
Update existing logics to support each mode accordingly.

Also, modify the maximum physical APIC ID for AVIC to 255 to reflect
the actual value supported by the architecture.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-5-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:45:08 -04:00
Suravee Suthikulpanit
4bdec12aa8 KVM: SVM: Detect X2APIC virtualization (x2AVIC) support
Add CPUID check for the x2APIC virtualization (x2AVIC) feature.
If available, the SVM driver can support both AVIC and x2AVIC modes
when load the kvm_amd driver with avic=1. The operating mode will be
determined at runtime depending on the guest APIC mode.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-4-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:44:54 -04:00
Suravee Suthikulpanit
bf348f667e KVM: x86: lapic: Rename [GET/SET]_APIC_DEST_FIELD to [GET/SET]_XAPIC_DEST_FIELD
To signify that the macros only support 8-bit xAPIC destination ID.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220519102709.24125-3-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 12:44:34 -04:00
Paolo Bonzini
4de5c54f8c KVM: nVMX: clean up posted interrupt descriptor try_cmpxchg
Rely on try_cmpxchg64 for re-reading the PID on failure, using READ_ONCE
only right before the first iteration.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 11:45:45 -04:00
Jue Wang
aebc3ca190 KVM: x86: Enable CMCI capability by default and handle injected UCNA errors
This patch enables MCG_CMCI_P by default in kvm_mce_cap_supported. It
reuses ioctl KVM_X86_SET_MCE to implement injection of UnCorrectable
No Action required (UCNA) errors, signaled via Corrected Machine
Check Interrupt (CMCI).

Neither of the CMCI and UCNA emulations depends on hardware.

Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-8-juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:03 -04:00
Jue Wang
281b52780b KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs.
This patch adds the emulation of IA32_MCi_CTL2 registers to KVM. A
separate mci_ctl2_banks array is used to keep the existing mce_banks
register layout intact.

In Machine Check Architecture, in addition to MCG_CMCI_P, bit 30 of
the per-bank register IA32_MCi_CTL2 controls whether Corrected Machine
Check error reporting is enabled.

Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-7-juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:03 -04:00
Jue Wang
087acc4e18 KVM: x86: Use kcalloc to allocate the mce_banks array.
This patch updates the allocation of mce_banks with the array allocation
API (kcalloc) as a precedent for the later mci_ctl2_banks to implement
per-bank control of Corrected Machine Check Interrupt (CMCI).

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-6-juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:02 -04:00
Jue Wang
4b903561ec KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic.
This patch calculates the number of lvt entries as part of
KVM_X86_MCE_SETUP conditioned on the presence of MCG_CMCI_P bit in
MCG_CAP and stores result in kvm_lapic. It translats from APIC_LVTx
register to index in lapic_lvt_entry enum. It extends the APIC_LVTx
macro as well as other lapic write/reset handling etc to support
Corrected Machine Check Interrupt.

Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-5-juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:02 -04:00
Jue Wang
987f625e07 KVM: x86: Add APIC_LVTx() macro.
An APIC_LVTx macro is introduced to calcualte the APIC_LVTx register
offset based on the index in the lapic_lvt_entry enum. Later patches
will extend the APIC_LVTx macro to support the APIC_LVTCMCI register
in order to implement Corrected Machine Check Interrupt signaling.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-4-juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:02 -04:00
Paolo Bonzini
0378739401 KVM: x86/mmu: Avoid unnecessary flush on eager page split
The TLB flush before installing the newly-populated lower level
page table is unnecessary if the lower-level page table maps
the huge page identically.  KVM knows it is if it did not reuse
an existing shadow page table, tell drop_large_spte() to skip
the flush in that case.

Extracted from a patch by David Matlack.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:01 -04:00
Jue Wang
1d8c681fb6 KVM: x86: Fill apic_lvt_mask with enums / explicit entries.
This patch defines a lapic_lvt_entry enum used as explicit indices to
the apic_lvt_mask array. In later patches a LVT_CMCI will be added to
implement the Corrected Machine Check Interrupt signaling.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-3-juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:01 -04:00
Jue Wang
951ceb94ed KVM: x86: Make APIC_VERSION capture only the magic 0x14UL.
Refactor APIC_VERSION so that the maximum number of LVT entries is
inserted at runtime rather than compile time. This will be used in a
subsequent commit to expose the LVT CMCI Register to VMs that support
Corrected Machine Check error counting/signaling
(IA32_MCG_CAP.MCG_CMCI_P=1).

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-2-juew@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:01 -04:00
David Matlack
ada51a9de7 KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs
Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush.  As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.

This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits.  However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).

[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]

Suggested-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:52:00 -04:00
Paolo Bonzini
0cd8dc7398 KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()
Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE.  This is done by drop_large_spte().

However, dropping the large SPTE is really only necessary before the
sp is installed.  While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page().  Move the call
there instead of having it in each and every caller.

To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.

Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:59 -04:00
David Matlack
20d49186c0 KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels
Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
is fine for now since KVM never creates intermediate huge pages during
dirty logging. In other words, KVM always replaces 1GiB pages directly
with 4KiB pages, so there is no reason to look for collapsible 2MiB
pages.

However, this will stop being true once the shadow MMU participates in
eager page splitting. During eager page splitting, each 1GiB is first
split into 2MiB pages and then those are split into 4KiB pages. The
intermediate 2MiB pages may be left behind if an error condition causes
eager page splitting to bail early.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-20-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:59 -04:00
David Matlack
47855da055 KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
shadowing a non-executable huge page.

To fix this, pass in the role of the child shadow page where the huge
page will be split and derive the execution permission from that.  This
is correct because huge pages are always split with direct shadow page
and thus the shadow page role contains the correct access permissions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-19-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:59 -04:00
David Matlack
6a97575d5c KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU.  Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page.  The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:58 -04:00
David Matlack
81cb4657e9 KVM: x86/mmu: Update page stats in __rmap_add()
Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-17-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:58 -04:00
David Matlack
2ff9039a75 KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:57 -04:00
David Matlack
6ec6509eea KVM: x86/mmu: Pass const memslot to rmap_add()
Constify rmap_add()'s @slot parameter; it is simply passed on to
gfn_to_rmap(), which takes a const memslot.

No functional change intended.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-15-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:57 -04:00
David Matlack
cbd858b17e KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()
Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
indirect shadow pages, so it's safe to pass in NULL when looking up
direct shadow pages.

This will be used for doing eager page splitting, which allocates direct
shadow pages from the context of a VM ioctl without access to a vCPU
pointer.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-14-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:57 -04:00
David Matlack
3cc736b357 KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page()
Get the kvm pointer from the caller, rather than deriving it from
vcpu->kvm, and plumb the kvm pointer all the way from
kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
is only needed to sync indirect shadow pages. In other words,
__kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
without a vcpu pointer. This enables eager page splitting, which needs
to allocate direct shadow pages during VM ioctls.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-13-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:56 -04:00
David Matlack
336081fb3f KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()
The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-12-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:56 -04:00
David Matlack
2f8b1b539b KVM: x86/mmu: Pass memory caches to allocate SPs separately
Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
will allocate the various pieces of memory for shadow pages as a
parameter, rather than deriving them from the vcpu pointer. This will be
useful in a future commit where shadow pages are allocated during VM
ioctls for eager page splitting, and thus will use a different set of
caches.

Preemptively pull the caches out all the way to
kvm_mmu_get_shadow_page() since eager page splitting will not be calling
kvm_mmu_alloc_shadow_page() directly.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-11-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:56 -04:00
David Matlack
be91177133 KVM: x86/mmu: Move guest PT write-protection to account_shadowed()
Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-10-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:55 -04:00
David Matlack
876546436d KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
Rename 2 functions:

  kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
  kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. It also aligns these functions with the naming
scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().

Prefer "shadow_page" over the shorter "sp" since these are core
functions and the line lengths aren't terrible.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-9-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:55 -04:00
David Matlack
c306aec81a KVM: x86/mmu: Consolidate shadow page allocation and initialization
Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
the latter so that all shadow page allocation and initialization happens
in one place.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-8-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:54 -04:00
David Matlack
94c8136448 KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
functions:

kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role.

kvm_mmu_alloc_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:54 -04:00
David Matlack
7f49777550 KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes
The quadrant is only used when gptes are 4 bytes, but
mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE
page directories regardless. Make this less confusing by only passing in
a non-zero quadrant when it is actually necessary.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:54 -04:00
David Matlack
2e65e842c5 KVM: x86/mmu: Derive shadow MMU page role from parent
Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page.  This
eliminates the dependency on the vCPU root role to allocate shadow page
tables, and reduces the number of parameters to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

Note that when calculating the MMU root role, we can take
@role.passthrough, @role.direct, and @role.access directly from
@vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
must be overridden for PAE page directories, when shadowing 32-bit
guest page tables with PAE page tables.

No functional change intended.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:53 -04:00
David Matlack
86938ab692 KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root()
The "direct" argument is vcpu->arch.mmu->root_role.direct,
because unlike non-root page tables, it's impossible to have
a direct root in an indirect MMU.  So just use that.

Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:53 -04:00
David Matlack
27a59d57f0 KVM: x86/mmu: Use a bool for direct
The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:53 -04:00
David Matlack
bb924ca69f KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
Commit fb58a9c345 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-2-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:52 -04:00
Ben Gardon
084cc29f8b KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis
In some cases, the NX hugepage mitigation for iTLB multihit is not
needed for all guests on a host. Allow disabling the mitigation on a
per-VM basis to avoid the performance hit of NX hugepages on trusted
workloads.

In order to disable NX hugepages on a VM, ensure that the userspace
actor has permission to reboot the system. Since disabling NX hugepages
would allow a guest to crash the system, it is similar to reboot
permissions.

Ideally, KVM would require userspace to prove it has access to KVM's
nx_huge_pages module param, e.g. so that userspace can opt out without
needing full reboot permissions.  But getting access to the module param
file info is difficult because it is buried in layers of sysfs and module
glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases.

Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:49 -04:00
Ben Gardon
1c4dc57328 KVM: x86: Fix errant brace in KVM capability handling
The braces around the KVM_CAP_XSAVE2 block also surround the
KVM_CAP_PMU_CAPABILITY block, likely the result of a merge issue. Simply
move the curly brace back to where it belongs.

Fixes: ba7bb663f5 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-8-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:48 -04:00
Peter Gonda
6defa24d3b KVM: SEV: Init target VMCBs in sev_migrate_from
The target VMCBs during an intra-host migration need to correctly setup
for running SEV and SEV-ES guests. Add sev_init_vmcb() function and make
sev_es_init_vmcb() static. sev_init_vmcb() uses the now private function
to init SEV-ES guests VMCBs when needed.

Fixes: 0b020f5af0 ("KVM: SEV: Add support for SEV-ES intra host migration")
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Signed-off-by: Peter Gonda <pgonda@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <20220623173406.744645-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:10:18 -04:00
Mingwei Zhang
ebdec859fa KVM: x86/svm: add __GFP_ACCOUNT to __sev_dbg_{en,de}crypt_user()
Adding the accounting flag when allocating pages within the SEV function,
since these memory pages should belong to individual VM.

No functional change intended.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220623171858.2083637-1-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 03:58:06 -04:00
Sean Christopherson
bfbcc81bb8 KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behavior
Add a quirk for KVM's behavior of emulating intercepted MONITOR/MWAIT
instructions a NOPs regardless of whether or not they are supported in
guest CPUID.  KVM's current behavior was likely motiviated by a certain
fruity operating system that expects MONITOR/MWAIT to be supported
unconditionally and blindly executes MONITOR/MWAIT without first checking
CPUID.  And because KVM does NOT advertise MONITOR/MWAIT to userspace,
that's effectively the default setup for any VMM that regurgitates
KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2.

Note, this quirk interacts with KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT.  The
behavior is actually desirable, as userspace VMMs that want to
unconditionally hide MONITOR/MWAIT from the guest can leave the
MISC_ENABLE quirk enabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220608224516.3788274-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 11:50:42 -04:00
Sean Christopherson
ff81a90f45 KVM: x86: Ignore benign host writes to "unsupported" F15H_PERF_CTL MSRs
Ignore host userspace writes of '0' to F15H_PERF_CTL MSRs KVM reports
in the MSR-to-save list, but the MSRs are ultimately unsupported.  All
MSRs in said list must be writable by userspace, e.g. if userspace sends
the list back at KVM without filtering out the MSRs it doesn't need.

Note, reads of said MSRs already have the desired behavior.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 11:50:33 -04:00