Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow
paging for 5-level guest page tables. PT_MAX_FULL_LEVELS is used to
size the arrays that track guest pages table information, i.e. using a
"max levels" of 4 causes KVM to access garbage beyond the end of an
array when querying state for level 5 entries. E.g. FNAME(gpte_changed)
will read garbage and most likely return %true for a level 5 entry,
soft-hanging the guest because FNAME(fetch) will restart the guest
instead of creating SPTEs because it thinks the guest PTE has changed.
Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS
gets to stay "4" for the PTTYPE_EPT case.
Fixes: 855feb6736 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fold max_mapping_level() into kvm_mmu_hugepage_adjust() now that HugeTLB
mappings are handled in kvm_mmu_hugepage_adjust(), i.e. there isn't a
need to pre-calculate the max mapping level. Co-locating all hugepage
checks eliminates a memslot lookup, at the cost of performing the
__mmu_gfn_lpage_is_disallowed() checks while holding mmu_lock.
The latency of lpage_is_disallowed() is likely negligible relative to
the rest of the code run while holding mmu_lock, and can be offset to
some extent by eliminating the mmu_gfn_lpage_is_disallowed() check in
set_spte() in a future patch. Eliminating the check in set_spte() is
made possible by performing the initial lpage_is_disallowed() checks
while holding mmu_lock.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove logic to retrieve the original gfn now that HugeTLB mappings are
are identified in FNAME(fetch), i.e. FNAME(page_fault) no longer adjusts
the level or gfn.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove KVM's HugeTLB specific logic and instead rely on walking the host
page tables (already done for THP) to identify HugeTLB mappings.
Eliminating the HugeTLB-only logic avoids taking mmap_sem and calling
find_vma() for all hugepage compatible page faults, and simplifies KVM's
page fault code by consolidating all hugepage adjustments into a common
helper.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refactor transparent_hugepage_adjust() in preparation for walking the
host page tables to identify hugepage mappings, initially for THP pages,
and eventualy for HugeTLB and DAX-backed pages as well. The latter
cases support 1gb pages, i.e. the adjustment logic needs access to the
max allowed level.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework the handling of nEPT's bad memtype/XWR checks to micro-optimize
the checks as much as possible. Move the check to a separate helper,
__is_bad_mt_xwr(), which allows the guest_rsvd_check usage in
paging_tmpl.h to omit the check entirely for paging32/64 (bad_mt_xwr is
always zero for non-nEPT) while retaining the bitwise-OR of the current
code for the shadow_zero_check in walk_shadow_page_get_mmio_spte().
Add a comment for the bitwise-OR usage in the mmio spte walk to avoid
future attempts to "fix" the code, which is what prompted this
optimization in the first place[*].
Opportunistically remove the superfluous '!= 0' and parantheses, and
use BIT_ULL() instead of open coding its equivalent.
The net effect is that code generation is largely unchanged for
walk_shadow_page_get_mmio_spte(), marginally better for
ept_prefetch_invalid_gpte(), and significantly improved for
paging32/64_prefetch_invalid_gpte().
Note, walk_shadow_page_get_mmio_spte() can't use a templated version of
the memtype/XRW as it works on the host's shadow PTEs, e.g. checks that
KVM hasn't borked its EPT tables. Even if it could be templated, the
benefits of having a single implementation far outweight the few uops
that would be saved for NPT or non-TDP paging, e.g. most compilers
inline it all the way to up kvm_mmu_page_fault().
[*] https://lkml.kernel.org/r/20200108001859.25254-1-sean.j.christopherson@intel.com
Cc: Jim Mattson <jmattson@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN on the existing invalid root_hpa checks in __direct_map() and
FNAME(fetch). The "legitimate" path that invalidated root_hpa in the
middle of a page fault is long since gone, i.e. it should no longer be
impossible to invalidate in the middle of a page fault[*].
The root_hpa checks were added by two related commits
989c6b34f6 ("KVM: MMU: handle invalid root_hpa at __direct_map")
37f6a4e237 ("KVM: x86: handle invalid root_hpa everywhere")
to fix a bug where nested_vmx_vmexit() could be called *in the middle*
of a page fault. At the time, vmx_interrupt_allowed(), which was and
still is used by kvm_can_do_async_pf() via ->interrupt_allowed(),
directly invoked nested_vmx_vmexit() to switch from L2 to L1 to emulate
a VM-Exit on a pending interrupt. Emulating the nested VM-Exit resulted
in root_hpa being invalidated by kvm_mmu_reset_context() without
explicitly terminating the page fault.
Now that root_hpa is checked for validity by kvm_mmu_page_fault(), WARN
on an invalid root_hpa to detect any flows that reset the MMU while
handling a page fault. The broken vmx_interrupt_allowed() behavior has
long since been fixed and resetting the MMU during a page fault should
not be considered legal behavior.
[*] It's actually technically possible in FNAME(page_fault)() because it
calls inject_page_fault() when the guest translation is invalid, but
in that case the page fault handling is immediately terminated.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the calls to thp_adjust() down a level from the page fault handlers
to the map/fetch helpers and remove the page count shuffling done in
thp_adjust().
Despite holding a reference to the underlying page while processing a
page fault, the page fault flows don't actually rely on holding a
reference to the page when thp_adjust() is called. At that point, the
fault handlers hold mmu_lock, which prevents mmu_notifier from completing
any invalidations, and have verified no invalidations from mmu_notifier
have occurred since the page reference was acquired (which is done prior
to taking mmu_lock).
The kvm_release_pfn_clean()/kvm_get_pfn() dance in thp_adjust() is a
quirk that is necessitated because thp_adjust() modifies the pfn that is
consumed by its caller. Because the page fault handlers call
kvm_release_pfn_clean() on said pfn, thp_adjust() needs to transfer the
reference to the correct pfn purely for correctness when the pfn is
released.
Calling thp_adjust() from __direct_map() and FNAME(fetch) means the pfn
adjustment doesn't change the pfn as seen by the page fault handlers,
i.e. the pfn released by the page fault handlers is the same pfn that
was returned by gfn_to_pfn().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Restrict the max level for a shadow page based on the guest's level
instead of capping the level after the fact for host-mapped huge pages,
e.g. hugetlbfs pages. Explicitly capping the max level using the guest
mapping level also eliminates FNAME(page_fault)'s subtle dependency on
THP only supporting 2mb pages.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refactor the page fault handlers and mapping_level() to track the max
allowed page level instead of only tracking if a 4k page is mandatory
due to one restriction or another. This paves the way for cleanly
consolidating tdp_page_fault() and nonpaging_page_fault(), and for
eliminating a redundant check on mmu_gfn_lpage_is_disallowed().
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert a plethora of parameters and variables in the MMU and page fault
flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
addresses. When TDP is enabled, the fault address is a guest physical
address and thus can be a 64-bit value, even when both KVM and its guest
are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
64-bit field, not a natural width field.
Using a gva_t for the fault address means KVM will incorrectly drop the
upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
translate L2 GPAs to L1 GPAs.
Opportunistically rename variables and parameters to better reflect the
dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
"addr" instead of "vaddr" when the address may be either a GVA or an L2
GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
a confusing "gpa_t gva" declaration; this also sets the stage for a
future patch to combing nonpaging_page_fault() and tdp_page_fault() with
minimal churn.
Sprinkle in a few comments to document flows where an address is known
to be a GVA and thus can be safely truncated to a 32-bit value. Add
WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
document such cases and detect bugs.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Preparatory work for shattering mmu.c into multiple files. Besides making it easier
to follow, this will also make it possible to write unit tests for various parts.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>