Commit Graph

13 Commits

Author SHA1 Message Date
Sean Christopherson
f6ab0107a4 KVM: x86/mmu: Fix struct guest_walker arrays for 5-level paging
Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow
paging for 5-level guest page tables.  PT_MAX_FULL_LEVELS is used to
size the arrays that track guest pages table information, i.e. using a
"max levels" of 4 causes KVM to access garbage beyond the end of an
array when querying state for level 5 entries.  E.g. FNAME(gpte_changed)
will read garbage and most likely return %true for a level 5 entry,
soft-hanging the guest because FNAME(fetch) will restart the guest
instead of creating SPTEs because it thinks the guest PTE has changed.

Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS
gets to stay "4" for the PTTYPE_EPT case.

Fixes: 855feb6736 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-12 20:09:44 +01:00
Sean Christopherson
293e306e7f KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()
Fold max_mapping_level() into kvm_mmu_hugepage_adjust() now that HugeTLB
mappings are handled in kvm_mmu_hugepage_adjust(), i.e. there isn't a
need to pre-calculate the max mapping level.  Co-locating all hugepage
checks eliminates a memslot lookup, at the cost of performing the
__mmu_gfn_lpage_is_disallowed() checks while holding mmu_lock.

The latency of lpage_is_disallowed() is likely negligible relative to
the rest of the code run while holding mmu_lock, and can be offset to
some extent by eliminating the mmu_gfn_lpage_is_disallowed() check in
set_spte() in a future patch.  Eliminating the check in set_spte() is
made possible by performing the initial lpage_is_disallowed() checks
while holding mmu_lock.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27 20:00:08 +01:00
Sean Christopherson
09c4453ee8 KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch)
Remove logic to retrieve the original gfn now that HugeTLB mappings are
are identified in FNAME(fetch), i.e. FNAME(page_fault) no longer adjusts
the level or gfn.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27 20:00:07 +01:00
Sean Christopherson
83f06fa7a6 KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings
Remove KVM's HugeTLB specific logic and instead rely on walking the host
page tables (already done for THP) to identify HugeTLB mappings.
Eliminating the HugeTLB-only logic avoids taking mmap_sem and calling
find_vma() for all hugepage compatible page faults, and simplifies KVM's
page fault code by consolidating all hugepage adjustments into a common
helper.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27 20:00:06 +01:00
Sean Christopherson
17eff01904 KVM: x86/mmu: Refactor THP adjust to prep for changing query
Refactor transparent_hugepage_adjust() in preparation for walking the
host page tables to identify hugepage mappings, initially for THP pages,
and eventualy for HugeTLB and DAX-backed pages as well.  The latter
cases support 1gb pages, i.e. the adjustment logic needs access to the
max allowed level.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27 20:00:04 +01:00
Sean Christopherson
b5c3c1b3c6 KVM: x86/mmu: Micro-optimize nEPT's bad memptype/XWR checks
Rework the handling of nEPT's bad memtype/XWR checks to micro-optimize
the checks as much as possible.  Move the check to a separate helper,
__is_bad_mt_xwr(), which allows the guest_rsvd_check usage in
paging_tmpl.h to omit the check entirely for paging32/64 (bad_mt_xwr is
always zero for non-nEPT) while retaining the bitwise-OR of the current
code for the shadow_zero_check in walk_shadow_page_get_mmio_spte().

Add a comment for the bitwise-OR usage in the mmio spte walk to avoid
future attempts to "fix" the code, which is what prompted this
optimization in the first place[*].

Opportunistically remove the superfluous '!= 0' and parantheses, and
use BIT_ULL() instead of open coding its equivalent.

The net effect is that code generation is largely unchanged for
walk_shadow_page_get_mmio_spte(), marginally better for
ept_prefetch_invalid_gpte(), and significantly improved for
paging32/64_prefetch_invalid_gpte().

Note, walk_shadow_page_get_mmio_spte() can't use a templated version of
the memtype/XRW as it works on the host's shadow PTEs, e.g. checks that
KVM hasn't borked its EPT tables.  Even if it could be templated, the
benefits of having a single implementation far outweight the few uops
that would be saved for NPT or non-TDP paging, e.g. most compilers
inline it all the way to up kvm_mmu_page_fault().

[*] https://lkml.kernel.org/r/20200108001859.25254-1-sean.j.christopherson@intel.com

Cc: Jim Mattson <jmattson@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-21 14:45:31 +01:00
Sean Christopherson
f8052a053a KVM: x86/mmu: Reorder the reserved bit check in prefetch_invalid_gpte()
Move the !PRESENT and !ACCESSED checks in FNAME(prefetch_invalid_gpte)
above the call to is_rsvd_bits_set().  For a well behaved guest, the
!PRESENT and !ACCESSED are far more likely to evaluate true than the
reserved bit checks, and they do not require additional memory accesses.

Before:
 Dump of assembler code for function paging32_prefetch_invalid_gpte:
   0x0000000000044240 <+0>:     callq  0x44245 <paging32_prefetch_invalid_gpte+5>
   0x0000000000044245 <+5>:     mov    %rcx,%rax
   0x0000000000044248 <+8>:     shr    $0x7,%rax
   0x000000000004424c <+12>:    and    $0x1,%eax
   0x000000000004424f <+15>:    lea    0x0(,%rax,4),%r8
   0x0000000000044257 <+23>:    add    %r8,%rax
   0x000000000004425a <+26>:    mov    %rcx,%r8
   0x000000000004425d <+29>:    and    0x120(%rsi,%rax,8),%r8
   0x0000000000044265 <+37>:    mov    0x170(%rsi),%rax
   0x000000000004426c <+44>:    shr    %cl,%rax
   0x000000000004426f <+47>:    and    $0x1,%eax
   0x0000000000044272 <+50>:    or     %rax,%r8
   0x0000000000044275 <+53>:    jne    0x4427c <paging32_prefetch_invalid_gpte+60>
   0x0000000000044277 <+55>:    test   $0x1,%cl
   0x000000000004427a <+58>:    jne    0x4428a <paging32_prefetch_invalid_gpte+74>
   0x000000000004427c <+60>:    mov    %rdx,%rsi
   0x000000000004427f <+63>:    callq  0x44080 <drop_spte>
   0x0000000000044284 <+68>:    mov    $0x1,%eax
   0x0000000000044289 <+73>:    retq
   0x000000000004428a <+74>:    xor    %eax,%eax
   0x000000000004428c <+76>:    and    $0x20,%ecx
   0x000000000004428f <+79>:    jne    0x44289 <paging32_prefetch_invalid_gpte+73>
   0x0000000000044291 <+81>:    mov    %rdx,%rsi
   0x0000000000044294 <+84>:    callq  0x44080 <drop_spte>
   0x0000000000044299 <+89>:    mov    $0x1,%eax
   0x000000000004429e <+94>:    jmp    0x44289 <paging32_prefetch_invalid_gpte+73>
 End of assembler dump.

After:
 Dump of assembler code for function paging32_prefetch_invalid_gpte:
   0x0000000000044240 <+0>:     callq  0x44245 <paging32_prefetch_invalid_gpte+5>
   0x0000000000044245 <+5>:     test   $0x1,%cl
   0x0000000000044248 <+8>:     je     0x4424f <paging32_prefetch_invalid_gpte+15>
   0x000000000004424a <+10>:    test   $0x20,%cl
   0x000000000004424d <+13>:    jne    0x4425d <paging32_prefetch_invalid_gpte+29>
   0x000000000004424f <+15>:    mov    %rdx,%rsi
   0x0000000000044252 <+18>:    callq  0x44080 <drop_spte>
   0x0000000000044257 <+23>:    mov    $0x1,%eax
   0x000000000004425c <+28>:    retq
   0x000000000004425d <+29>:    mov    %rcx,%rax
   0x0000000000044260 <+32>:    mov    (%rsi),%rsi
   0x0000000000044263 <+35>:    shr    $0x7,%rax
   0x0000000000044267 <+39>:    and    $0x1,%eax
   0x000000000004426a <+42>:    lea    0x0(,%rax,4),%r8
   0x0000000000044272 <+50>:    add    %r8,%rax
   0x0000000000044275 <+53>:    mov    %rcx,%r8
   0x0000000000044278 <+56>:    and    0x120(%rsi,%rax,8),%r8
   0x0000000000044280 <+64>:    mov    0x170(%rsi),%rax
   0x0000000000044287 <+71>:    shr    %cl,%rax
   0x000000000004428a <+74>:    and    $0x1,%eax
   0x000000000004428d <+77>:    mov    %rax,%rcx
   0x0000000000044290 <+80>:    xor    %eax,%eax
   0x0000000000044292 <+82>:    or     %rcx,%r8
   0x0000000000044295 <+85>:    je     0x4425c <paging32_prefetch_invalid_gpte+28>
   0x0000000000044297 <+87>:    mov    %rdx,%rsi
   0x000000000004429a <+90>:    callq  0x44080 <drop_spte>
   0x000000000004429f <+95>:    mov    $0x1,%eax
   0x00000000000442a4 <+100>:   jmp    0x4425c <paging32_prefetch_invalid_gpte+28>
 End of assembler dump.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-21 14:45:30 +01:00
Sean Christopherson
0c7a98e34d KVM: x86/mmu: WARN on an invalid root_hpa
WARN on the existing invalid root_hpa checks in __direct_map() and
FNAME(fetch).  The "legitimate" path that invalidated root_hpa in the
middle of a page fault is long since gone, i.e. it should no longer be
impossible to invalidate in the middle of a page fault[*].

The root_hpa checks were added by two related commits

  989c6b34f6 ("KVM: MMU: handle invalid root_hpa at __direct_map")
  37f6a4e237 ("KVM: x86: handle invalid root_hpa everywhere")

to fix a bug where nested_vmx_vmexit() could be called *in the middle*
of a page fault.  At the time, vmx_interrupt_allowed(), which was and
still is used by kvm_can_do_async_pf() via ->interrupt_allowed(),
directly invoked nested_vmx_vmexit() to switch from L2 to L1 to emulate
a VM-Exit on a pending interrupt.  Emulating the nested VM-Exit resulted
in root_hpa being invalidated by kvm_mmu_reset_context() without
explicitly terminating the page fault.

Now that root_hpa is checked for validity by kvm_mmu_page_fault(), WARN
on an invalid root_hpa to detect any flows that reset the MMU while
handling a page fault.  The broken vmx_interrupt_allowed() behavior has
long since been fixed and resetting the MMU during a page fault should
not be considered legal behavior.

[*] It's actually technically possible in FNAME(page_fault)() because it
    calls inject_page_fault() when the guest translation is invalid, but
    in that case the page fault handling is immediately terminated.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08 18:16:08 +01:00
Sean Christopherson
4cd071d13c KVM: x86/mmu: Move calls to thp_adjust() down a level
Move the calls to thp_adjust() down a level from the page fault handlers
to the map/fetch helpers and remove the page count shuffling done in
thp_adjust().

Despite holding a reference to the underlying page while processing a
page fault, the page fault flows don't actually rely on holding a
reference to the page when thp_adjust() is called.  At that point, the
fault handlers hold mmu_lock, which prevents mmu_notifier from completing
any invalidations, and have verified no invalidations from mmu_notifier
have occurred since the page reference was acquired (which is done prior
to taking mmu_lock).

The kvm_release_pfn_clean()/kvm_get_pfn() dance in thp_adjust() is a
quirk that is necessitated because thp_adjust() modifies the pfn that is
consumed by its caller.  Because the page fault handlers call
kvm_release_pfn_clean() on said pfn, thp_adjust() needs to transfer the
reference to the correct pfn purely for correctness when the pfn is
released.

Calling thp_adjust() from __direct_map() and FNAME(fetch) means the pfn
adjustment doesn't change the pfn as seen by the page fault handlers,
i.e. the pfn released by the page fault handlers is the same pfn that
was returned by gfn_to_pfn().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08 18:16:07 +01:00
Sean Christopherson
cbe1e6f035 KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU
Restrict the max level for a shadow page based on the guest's level
instead of capping the level after the fact for host-mapped huge pages,
e.g. hugetlbfs pages.  Explicitly capping the max level using the guest
mapping level also eliminates FNAME(page_fault)'s subtle dependency on
THP only supporting 2mb pages.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08 18:16:05 +01:00
Sean Christopherson
39ca1ecb78 KVM: x86/mmu: Refactor handling of forced 4k pages in page faults
Refactor the page fault handlers and mapping_level() to track the max
allowed page level instead of only tracking if a 4k page is mandatory
due to one restriction or another.  This paves the way for cleanly
consolidating tdp_page_fault() and nonpaging_page_fault(), and for
eliminating a redundant check on mmu_gfn_lpage_is_disallowed().

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08 18:16:05 +01:00
Sean Christopherson
736c291c9f KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM
Convert a plethora of parameters and variables in the MMU and page fault
flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.

Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
addresses.  When TDP is enabled, the fault address is a guest physical
address and thus can be a 64-bit value, even when both KVM and its guest
are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
64-bit field, not a natural width field.

Using a gva_t for the fault address means KVM will incorrectly drop the
upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
translate L2 GPAs to L1 GPAs.

Opportunistically rename variables and parameters to better reflect the
dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
"addr" instead of "vaddr" when the address may be either a GVA or an L2
GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
a confusing "gpa_t gva" declaration; this also sets the stage for a
future patch to combing nonpaging_page_fault() and tdp_page_fault() with
minimal churn.

Sprinkle in a few comments to document flows where an address is known
to be a GVA and thus can be safely truncated to a 32-bit value.  Add
WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
document such cases and detect bugs.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08 18:16:02 +01:00
Paolo Bonzini
c50d8ae3a1 KVM: x86: create mmu/ subdirectory
Preparatory work for shattering mmu.c into multiple files.  Besides making it easier
to follow, this will also make it possible to write unit tests for various parts.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-21 12:03:50 +01:00