linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-25 12:04:46 +08:00

Author	SHA1	Message	Date
Paolo Bonzini	2fdcc1b324	KVM: x86/mmu: Avoid indirect call for get_cr3 Most of the time, calls to get_guest_pgd result in calling kvm_read_cr3 (the exception is only nested TDP). Hardcode the default instead of using the get_cr3 function, avoiding a retpoline if they are enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-2-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 07:46:42 -07:00
Lai Jiangshan	19ace7d6ca	KVM: x86/mmu: Skip calling mmu->sync_spte() when the spte is 0 Sync the spte only when the spte is set and avoid the indirect branch. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-5-jiangshanlai@gmail.com [sean: add wrapper instead of open coding each check] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:55 -07:00
Lai Jiangshan	91ca7672dc	kvm: x86/mmu: Remove @no_dirty_log from FNAME(prefetch_gpte) FNAME(prefetch_gpte) is always called with @no_dirty_log=true. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-4-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:55 -07:00
Lai Jiangshan	9fd4a4e3a3	KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead. In hardware TLB, invalidating TLB entries means the translations are removed from the TLB. In KVM shadowed vTLB, the translations (combinations of shadow paging and hardware TLB) are generally maintained as long as they remain "clean" when the TLB of an address space (i.e. a PCID or all) is flushed with the help of write-protections, sp->unsync, and kvm_sync_page(), where "clean" in this context means that no updates to KVM's SPTEs are needed. However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is unsync, and thus triggers a remote flush even if the original vTLB entry is clean, i.e. is usable as-is. Besides this, FNAME(invlpg) is largely is a duplicate implementation of FNAME(sync_spte) to invalidate a vTLB entry. To address both issues, reuse FNAME(sync_spte) to share the code and slightly modify the semantics, i.e. keep the vTLB entry if it's "clean" and avoid remote TLB flush. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:54 -07:00
Lai Jiangshan	ed335278bd	KVM: x86/mmu: Allow the roots to be invalid in FNAME(invlpg) Don't assume the current root to be valid, just check it and remove the WARN(). Also move the code to check if the root is valid into FNAME(invlpg) to simplify the code. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-2-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:53 -07:00
Lai Jiangshan	e6722d9211	KVM: x86/mmu: Reduce the update to the spte in FNAME(sync_spte) Sometimes when the guest updates its pagetable, it adds only new gptes to it without changing any existed one, so there is no point to update the sptes for these existed gptes. Also when the sptes for these unchanged gptes are updated, the AD bits are also removed since make_spte() is called with prefetch=true which might result unneeded TLB flushing. Just do nothing if the gpte's permissions are unchanged. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-7-jiangshanlai@gmail.com [sean: expand comment to call out A/D bits] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:48 -07:00
Lai Jiangshan	c3c6c9fc5d	KVM: x86/mmu: Move the code out of FNAME(sync_page)'s loop body into mmu.c Rename mmu->sync_page to mmu->sync_spte and move the code out of FNAME(sync_page)'s loop body into mmu.c. No functionalities change intended. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-6-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:44 -07:00
Lai Jiangshan	90e444702a	KVM: x86/mmu: Move the check in FNAME(sync_page) as kvm_sync_page_check() Prepare to check mmu->sync_page pointer before calling it. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-3-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 12:42:15 -07:00
Lai Jiangshan	753b43c9d1	KVM: x86/mmu: Use 64-bit address to invalidate to fix a subtle bug FNAME(invlpg)() and kvm_mmu_invalidate_gva() take a gva_t, i.e. unsigned long, as the type of the address to invalidate. On 32-bit kernels, the upper 32 bits of the GPA will get dropped when an L2 GPA address is invalidated in the shadowed nested TDP MMU. Convert it to u64 to fix the problem. Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-2-jiangshanlai@gmail.com [sean: tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 12:41:05 -07:00
Lai Jiangshan	9a96770049	KVM: x86/mmu: Remove FNAME(is_self_change_mapping) Drop FNAME(is_self_change_mapping) and instead rely on kvm_mmu_hugepage_adjust() to adjust the hugepage accordingly. Prior to commit `4cd071d13c` ("KVM: x86/mmu: Move calls to thp_adjust() down a level"), the hugepage adjustment was done before allocating new shadow pages, i.e. failed to restrict the hugepage sizes if a new shadow page resulted in account_shadowed() changing the disallowed hugepage tracking. Removing FNAME(is_self_change_mapping) fixes a bug reported by Huang Hang where KVM unnecessarily forces a 4KiB page. FNAME(is_self_change_mapping) has a defect in that it blindly disables _all_ hugepage mappings rather than trying to reduce the size of the hugepage. If the guest is writing to a 1GiB page and the 1GiB is self-referential but a 2MiB page is not, then KVM can and should create a 2MiB mapping. Add a comment above the call to kvm_mmu_hugepage_adjust() to call out the new dependency on adjusting the hugepage size after walking indirect PTEs. Reported-by: Huang Hang <hhuang@linux.alibaba.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20221213125538.81209-1-jiangshanlai@gmail.com [sean: rework changelog after separating out the emulator change] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230202182817.407394-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-14 10:28:57 -04:00
Lai Jiangshan	39fda5d873	KVM: x86/mmu: Detect write #PF to shadow pages during FNAME(fetch) walk Move the detection of write #PF to shadow pages, i.e. a fault on a write to a page table that is being shadowed by KVM that is used to translate the write itself, from FNAME(is_self_change_mapping) to FNAME(fetch). There is no need to detect the self-referential write before kvm_faultin_pfn() as KVM does not consume EMULTYPE_WRITE_PF_TO_SP for accesses that resolve to "error or no-slot" pfns, i.e. KVM doesn't allow retrying MMIO accesses or writes to read-only memslots. Detecting the EMULTYPE_WRITE_PF_TO_SP scenario in FNAME(fetch) will allow dropping FNAME(is_self_change_mapping) entirely, as the hugepage interaction can be deferred to kvm_mmu_hugepage_adjust(). Cc: Huang Hang <hhuang@linux.alibaba.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20221213125538.81209-1-jiangshanlai@gmail.com [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230202182817.407394-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-14 10:28:56 -04:00
Sean Christopherson	258d985f6e	KVM: x86/mmu: Use EMULTYPE flag to track write #PFs to shadow pages Use a new EMULTYPE flag, EMULTYPE_WRITE_PF_TO_SP, to track page faults on self-changing writes to shadowed page tables instead of propagating that information to the emulator via a semi-persistent vCPU flag. Using a flag in "struct kvm_vcpu_arch" is confusing, especially as implemented, as it's not at all obvious that clearing the flag only when emulation actually occurs is correct. E.g. if KVM sets the flag and then retries the fault without ever getting to the emulator, the flag will be left set for future calls into the emulator. But because the flag is consumed if and only if both EMULTYPE_PF and EMULTYPE_ALLOW_RETRY_PF are set, and because EMULTYPE_ALLOW_RETRY_PF is deliberately not set for direct MMUs, emulated MMIO, or while L2 is active, KVM avoids false positives on a stale flag since FNAME(page_fault) is guaranteed to be run and refresh the flag before it's ultimately consumed by the tail end of reexecute_instruction(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230202182817.407394-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-14 10:28:56 -04:00
Hou Wenlong	1b2dc73604	KVM: x86/mmu: Fix wrong start gfn of tlb flushing with range When a spte is dropped, the start gfn of tlb flushing should be the gfn of spte not the base gfn of SP which contains the spte. Also introduce a helper function to do range-based flushing when a spte is dropped, which would help prevent future buggy use of kvm_flush_remote_tlbs_with_address() in such case. Fixes: `c3134ce240` ("KVM: Replace old tlb flush function with new one to flush a specified range.") Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/72ac2169a261976f00c1703e88cda676dfb960f5.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:47 -08:00
Hou Wenlong	c667a3baed	KVM: x86/mmu: Move round_gfn_for_level() helper into mmu_internal.h Rounding down the GFN to a huge page size is a common pattern throughout KVM, so move round_gfn_for_level() helper in tdp_iter.c to mmu_internal.h for common usage. Also rename it as gfn_round_for_level() to use gfn_* prefix and clean up the other call sites. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/415c64782f27444898db650e21cf28eeb6441dfa.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:45 -08:00
Lai Jiangshan	9e3fbdfd9b	kvm: x86/mmu: Don't clear write flooding for direct SP Although there is no harm, but there is no point to clear write flooding for direct SP. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230105100310.6700-1-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:44 -08:00
David Matlack	354c908c06	KVM: x86/mmu: Handle no-slot faults in kvm_faultin_pfn() Handle faults on GFNs that do not have a backing memslot in kvm_faultin_pfn() and drop handle_abnormal_pfn(). This eliminates duplicate code in the various page fault handlers. Opportunistically tweak the comment about handling gfn > host.MAXPHYADDR to reflect that the effect of returning RET_PF_EMULATE at that point is to avoid creating an MMIO SPTE for such GFNs. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:21 -05:00
David Matlack	ba6e3fe255	KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn() Grab mmu_invalidate_seq in kvm_faultin_pfn() and stash it in struct kvm_page_fault. The eliminates duplicate code and reduces the amount of parameters needed for is_page_fault_stale(). Preemptively split out __kvm_faultin_pfn() to a separate function for use in subsequent commits. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:18 -05:00
Sean Christopherson	55c510e26a	KVM: x86/mmu: Rename NX huge pages fields/functions for consistency Rename most of the variables/functions involved in the NX huge page mitigation to provide consistency, e.g. lpage vs huge page, and NX huge vs huge NX, and also to provide clarity, e.g. to make it obvious the flag applies only to the NX huge page mitigation, not to any condition that prevents creating a huge page. Add a comment explaining what the newly named "possible_nx_huge_pages" tracks. Leave the nx_lpage_splits stat alone as the name is ABI and thus set in stone. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:31 -05:00
Sean Christopherson	428e921611	KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked Tag shadow pages that cannot be replaced with an NX huge page regardless of whether or not zapping the page would allow KVM to immediately create a huge page, e.g. because something else prevents creating a huge page. I.e. track pages that are disallowed from being NX huge pages regardless of whether or not the page could have been huge at the time of fault. KVM currently tracks pages that were disallowed from being huge due to the NX workaround if and only if the page could otherwise be huge. But that fails to handled the scenario where whatever restriction prevented KVM from installing a huge page goes away, e.g. if dirty logging is disabled, the host mapping level changes, etc... Failure to tag shadow pages appropriately could theoretically lead to false negatives, e.g. if a fetch fault requests a small page and thus isn't tracked, and a read/write fault later requests a huge page, KVM will not reject the huge page as it should. To avoid yet another flag, initialize the list_head and use list_empty() to determine whether or not a page is on the list of NX huge pages that should be recovered. Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is more involved due to mmu_lock being held for read. This will be addressed in a future commit. Fixes: `5bcaf3e171` ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:31 -05:00
Jilin Yuan	b85a97b851	KVM: x86/mmu: fix repeated words in comments Delete the redundant word 'to'. Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> Link: https://lore.kernel.org/r/20220831125217.12313-1-yuanjilin@cdjrlc.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-09-26 12:03:02 -04:00
Chao Peng	20ec3ebd70	KVM: Rename mmu_notifier_* to mmu_invalidate_* The motivation of this renaming is to make these variables and related helper functions less mmu_notifier bound and can also be used for non mmu_notifier based page invalidation. mmu_invalidate_* was chosen to better describe the purpose of 'invalidating' a page that those variables are used for. - mmu_notifier_seq/range_start/range_end are renamed to mmu_invalidate_seq/range_start/range_end. - mmu_notifier_retry{_hva} helper functions are renamed to mmu_invalidate_retry{_hva}. - mmu_notifier_count is renamed to mmu_invalidate_in_progress to avoid confusion with mn_active_invalidate_count. - While here, also update kvm_inc/dec_notifier_count() to kvm_mmu_invalidate_begin/end() to match the change for mmu_notifier_count. No functional change intended. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-08-19 04:05:41 -04:00
Sean Christopherson	79e48cec6c	KVM: x86/mmu: Add optimized helper to retrieve an SPTE's index Add spte_index() to dedup all the code that calculates a SPTE's index into its parent's page table and/or spt array. Opportunistically tweak the calculation to avoid pointer arithmetic, which is subtle (subtract in 8-byte chunks) and less performant (requires the compiler to generate the subtraction). Suggested-by: David Matlack <dmatlack@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220712020724.1262121-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-14 11:31:23 -04:00
Hou Wenlong	6e1d2a3f25	KVM: x86/mmu: Replace UNMAPPED_GVA with INVALID_GPA for gva_to_gpa() The result of gva_to_gpa() is physical address not virtual address, it is odd that UNMAPPED_GVA macro is used as the result for physical address. Replace UNMAPPED_GVA with INVALID_GPA and drop UNMAPPED_GVA macro. No functional change intended. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/6104978956449467d3c68f1ad7f2c2f6d771d0ee.1656667239.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2022-07-12 22:31:12 +00:00
Paolo Bonzini	0cd8dc7398	KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page() Before allocating a child shadow page table, all callers check whether the parent already points to a huge page and, if so, they drop that SPTE. This is done by drop_large_spte(). However, dropping the large SPTE is really only necessary before the sp is installed. While the sp is returned by kvm_mmu_get_child_sp(), installing it happens later in __link_shadow_page(). Move the call there instead of having it in each and every caller. To ensure that the shadow page is not linked twice if it was present, do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent: instead, return an error value if the shadow page already existed. This is a bit more verbose, but clearer than NULL. Finally, now that the drop_large_spte() name is not taken anymore, remove the two underscores in front of __drop_large_spte(). Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:59 -04:00
David Matlack	6a97575d5c	KVM: x86/mmu: Cache the access bits of shadowed translations Splitting huge pages requires allocating/finding shadow pages to replace the huge page. Shadow pages are keyed, in part, off the guest access permissions they are shadowing. For fully direct MMUs, there is no shadowing so the access bits in the shadow page role are always ACC_ALL. But during shadow paging, the guest can enforce whatever access permissions it wants. In particular, eager page splitting needs to know the permissions to use for the subpages, but KVM cannot retrieve them from the guest page tables because eager page splitting does not have a vCPU. Fortunately, the guest access permissions are easy to cache whenever page faults or FNAME(sync_page) update the shadow page tables; this is an extension of the existing cache of the shadowed GFNs in the gfns array of the shadow page. The access bits only take up 3 bits, which leaves 61 bits left over for gfns, which is more than enough. Now that the gfns array caches more information than just GFNs, rename it to shadowed_translation. While here, preemptively fix up the WARN_ON() that detects gfn mismatches in direct SPs. The WARN_ON() was paired with a pr_err_ratelimited(), which means that users could sometimes see the WARN without the accompanying error message. Fix this by outputting the error message as part of the WARN splat, and opportunistically make them WARN_ONCE() because if these ever fire, they are all but guaranteed to fire a lot and will bring down the kernel. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-18-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:58 -04:00
David Matlack	2e65e842c5	KVM: x86/mmu: Derive shadow MMU page role from parent Instead of computing the shadow page role from scratch for every new page, derive most of the information from the parent shadow page. This eliminates the dependency on the vCPU root role to allocate shadow page tables, and reduces the number of parameters to kvm_mmu_get_page(). Preemptively split out the role calculation to a separate function for use in a following commit. Note that when calculating the MMU root role, we can take @role.passthrough, @role.direct, and @role.access directly from @vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still must be overridden for PAE page directories, when shadowing 32-bit guest page tables with PAE page tables. No functional change intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:53 -04:00
Sean Christopherson	70e41c31bc	KVM: x86/mmu: Use common logic for computing the 32/64-bit base PA mask Use common logic for computing PT_BASE_ADDR_MASK for 32-bit, 64-bit, and EPT paging. Both PAGE_MASK and the new-common logic are supsersets of what is actually needed for 32-bit paging. PAGE_MASK sets bits 63:12 and the former GUEST_PT64_BASE_ADDR_MASK sets bits 51:12, so regardless of which value is used, the result will always be bits 31:12. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:30 -04:00
Sean Christopherson	f7384b8866	KVM: x86/mmu: Truncate paging32's PT_BASE_ADDR_MASK to 32 bits Truncate paging32's PT_BASE_ADDR_MASK to a pt_element_t, i.e. to 32 bits. Ignoring PSE huge pages, the mask is only used in conjunction with gPTEs, which are 32 bits, and so the address is limited to bits 31:12. PSE huge pages encoded PA bits 39:32 in PTE bits 20:13, i.e. need custom logic to handle their funky encoding regardless of PT_BASE_ADDR_MASK. Note, PT_LVL_OFFSET_MASK is somewhat confusing in that it computes the offset of the _gfn_, not of the gpa, i.e. not having bits 63:32 set in PT_BASE_ADDR_MASK is again correct. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:29 -04:00
Paolo Bonzini	f6b8ea6d43	KVM: x86/mmu: Use common macros to compute 32/64-bit paging masks Dedup the code for generating (most of) the per-type PT_* masks in paging_tmpl.h. The relevant macros only vary based on the number of bits per level, and that smidge of info is already provided in a common form as PT_LEVEL_BITS. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:29 -04:00
Sean Christopherson	2ca3129e80	KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs (PT64). SPTE and PT64 are _mostly_ the same, but the few differences are quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is very much specific to SPTEs. Opportunistically (and temporarily) move most guest macros into paging.h to clearly associate them with shadow paging, and to ensure that they're not used as of this commit. A future patch will eliminate them entirely. Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's needed for the quadrant calculation in kvm_mmu_get_page(). The quadrant calculation is hot enough (when using shadow paging with 32-bit guests) that adding a per-context helper is undesirable, and burying the computation in paging_tmpl.h with a forward declaration isn't exactly an improvement. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:28 -04:00
Sean Christopherson	b3fcdb04a9	KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h Move a handful of one-off macros and helpers for 32-bit PSE paging into paging_tmpl.h and hide them behind "PTTYPE == 32". Under no circumstance should anything but 32-bit shadow paging care about PSE paging. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:26 -04:00
Sean Christopherson	007a369fba	KVM: x86/mmu: Drop unused CMPXCHG macro from paging_tmpl.h Drop the CMPXCHG macro from paging_tmpl.h, it's no longer used now that KVM uses a common uaccess helper to do 8-byte CMPXCHG. Fixes: `f122dfe447` ("KVM: x86: Use __try_cmpxchg_user() to update guest PTE A/D bits") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220613225723.2734132-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-15 08:07:55 -04:00
Sean Christopherson	b8b9156ec6	KVM: x86/mmu: Comment FNAME(sync_page) to document TLB flushing logic Add a comment to FNAME(sync_page) to explain why the TLB flushing logic conspiculously doesn't handle the scenario of guest protections being reduced. Specifically, if synchronizing a SPTE drops execute protections, KVM will not emit a TLB flush, whereas dropping writable or clearing A/D bits does trigger a flush via mmu_spte_update(). Architecturally, until the GPTE is implicitly or explicitly flushed from the guest's perspective, KVM is not required to flush any old, stale translations. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220513195000.99371-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-08 04:47:10 -04:00
Sean Christopherson	9fb3565743	KVM: x86/mmu: Drop RWX=0 SPTEs during ept_sync_page() All of sync_page()'s existing checks filter out only !PRESENT gPTE, because without execute-only, all upper levels are guaranteed to be at least READABLE. However, if EPT with execute-only support is in use by L1, KVM can create an SPTE that is shadow-present but guest-inaccessible (RWX=0) if the upper level combined permissions are R (or RW) and the leaf EPTE is changed from R (or RW) to X. Because the EPTE is considered present when viewed in isolation, and no reserved bits are set, FNAME(prefetch_invalid_gpte) will consider the GPTE valid, and cause a not-present SPTE to be created. The SPTE is "correct": the guest translation is inaccessible because the combined protections of all levels yield RWX=0, and KVM will just redirect any vmexits to the guest. If EPT A/D bits are disabled, KVM can mistake the SPTE for an access-tracked SPTE, but again such confusion isn't fatal, as the "saved" protections are also RWX=0. However, creating a useless SPTE in general means that KVM messed up something, even if this particular goof didn't manifest as a functional bug. So, drop SPTEs whose new protections will yield a RWX=0 SPTE, and add a WARN in make_spte() to detect creation of SPTEs that will result in RWX=0 protections. Fixes: `d95c55687e` ("kvm: mmu: track read permission explicitly for shadow EPT page tables") Cc: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220513195000.99371-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-08 04:47:08 -04:00
Sean Christopherson	1075d41efd	KVM: x86/mmu: Expand and clean up page fault stats Expand and clean up the page fault stats. The current stats are at best incomplete, and at worst misleading. Differentiate between faults that are actually fixed vs those that result in an MMIO SPTE being created, track faults that are spurious, faults that trigger emulation, faults that that are fixed in the fast path, and last but not least, track the number of faults that are taken. Note, the number of faults that require emulation for write-protected shadow pages can roughly be calculated by subtracting the number of MMIO SPTEs created from the overall number of faults that trigger emulation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:43 -04:00
Sean Christopherson	5276c616ab	KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and kvm_faultin_pfn() to signal that the page fault handler should continue doing its thing. Aside from being gross and inefficient, using a boolean return to signal continue vs. stop makes it extremely difficult to add more helpers and/or move existing code to a helper. E.g. hypothetically, if nested MMUs were to gain a separate page fault handler in the future, everything up to the "is self-modifying PTE" check can be shared by all shadow MMUs, but communicating up the stack whether to continue on or stop becomes a nightmare. More concretely, proposed support for private guest memory ran into a similar issue, where it'll be forced to forego a helper in order to yield sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com. No functional change intended. Cc: David Matlack <dmatlack@google.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:42 -04:00
Lai Jiangshan	84e5ffd045	KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guest When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is allocated with role.level = 5 and the guest pagetable's root gfn. And root_sp->spt[0] is also allocated with the same gfn and the same role except role.level = 4. Luckily that they are different shadow pages, but only root_sp->spt[0] is the real translation of the guest pagetable. Here comes a problem: If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse) and uses the same gfn as the root page for nested NPT before and after switching gCR4_LA57. The host (hCR4_LA57=1) might use the same root_sp for the guest even the guest switches gCR4_LA57. The guest will see unexpected page mapped and L2 may exploit the bug and hurt L1. It is lucky that the problem can't hurt L0. And three special cases need to be handled: The root_sp should be like role.direct=1 sometimes: its contents are not backed by gptes, root_sp->gfns is meaningless. (For a normal high level sp in shadow paging, sp->gfns is often unused and kept zero, but it could be relevant and meaningful if sp->gfns is used because they are backed by concrete gptes.) For such root_sp in the case, root_sp is just a portal to contribute root_sp->spt[0], and root_sp->gfns should not be used and root_sp->spt[0] should not be dropped if gpte[0] of the guest root pagetable is changed. Such root_sp should not be accounted too. So add role.passthrough to distinguish the shadow pages in the hash when gCR4_LA57 is toggled and fix above special cases by using it in kvm_mmu_page_{get\|set}_gfn() and sp_has_gptes(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:50:00 -04:00
Paolo Bonzini	4d25502aa1	KVM: x86/mmu: replace root_level with cpu_role.base.level Remove another duplicate field of struct kvm_mmu. This time it's the root level for page table walking; the separate field is always initialized as cpu_role.base.level, so its users can look up the CPU mode directly instead. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:58 -04:00
Paolo Bonzini	7a458f0e1b	KVM: x86/mmu: remove extended bits from mmu_role, rename field mmu_role represents the role of the root of the page tables. It does not need any extended bits, as those govern only KVM's page table walking; the is_* functions used for page table walking always use the CPU role. ext.valid is not present anymore in the MMU role, but an all-zero MMU role is impossible because the level field is never zero in the MMU role. So just zap the whole mmu_role in order to force invalidation after CPUID is updated. While making this change, which requires touching almost every occurrence of "mmu_role", rename it to "root_role". Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:56 -04:00
Paolo Bonzini	ec283cb1dc	KVM: x86/mmu: remove ept_ad field The ept_ad field is used during page walk to determine if the guest PTEs have accessed and dirty bits. In the MMU role, the ad_disabled bit represents whether the shadow PTEs have the bits, so it would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just !mmu->mmu_role.base.ad_disabled. However, the similar field in the CPU mode, ad_disabled, is initialized correctly: to the opposite value of ept_ad for shadow EPT, and zero for non-EPT guest paging modes (which always have A/D bits). It is therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode, like other page-format fields; it just has to be inverted to account for the different polarity. In fact, now that the CPU mode is distinct from the MMU roles, it would even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and use !mmu->cpu_role.base.ad_disabled instead. I am not doing this because the macro has a small effect in terms of dead code elimination: text data bss dec hex 103544 16665 112 120321 1d601 # as of this patch 103746 16665 112 120523 1d6cb # without PT_HAVE_ACCESSED_DIRTY Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:54 -04:00
Paolo Bonzini	e5ed0fb010	KVM: x86/mmu: split cpu_role from mmu_role Snapshot the state of the processor registers that govern page walk into a new field of struct kvm_mmu. This is a more natural representation than having it mostly in mmu_role but not exclusively; the delta right now is represented in other fields, such as root_level. The nested MMU now has only the CPU role; and in fact the new function kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role, except that it has role.base.direct equal to !CR0.PG. For a walk-only MMU, "direct" has no meaning, but we set it to !CR0.PG so that role.ext.cr0_pg can go away in a future patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:53 -04:00
Paolo Bonzini	25cc05652c	KVM: x86/mmu: rephrase unclear comment If accessed bits are not supported there simple isn't any distinction between accessed and non-accessed gPTEs, so the comment does not make much sense. Rephrase it in terms of what happens if accessed bits are supported. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:18 -04:00
Sean Christopherson	f122dfe447	KVM: x86: Use __try_cmpxchg_user() to update guest PTE A/D bits Use the recently introduced __try_cmpxchg_user() to update guest PTE A/D bits instead of mapping the PTE into kernel address space. The VM_PFNMAP path is broken as it assumes that vm_pgoff is the base pfn of the mapped VMA range, which is conceptually wrong as vm_pgoff is the offset relative to the file and has nothing to do with the pfn. The horrific hack worked for the original use case (backing guest memory with /dev/mem), but leads to accessing "random" pfns for pretty much any other VM_PFNMAP case. Fixes: `bd53cb35a3` ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs") Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org> Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220202004945.2540433-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-13 13:37:47 -04:00
Sean Christopherson	ca2a7c22a1	KVM: x86/mmu: Derive EPT violation RWX bits from EPTE RWX bits Derive the mask of RWX bits reported on EPT violations from the mask of RWX bits that are shoved into EPT entries; the layout is the same, the EPT violation bits are simply shifted by three. Use the new shift and a slight copy-paste of the mask derivation instead of completely open coding the same to convert between the EPT entry bits and the exit qualification when synthesizing a nested EPT Violation. No functional change intended. Cc: SU Hang <darcy.sh@antgroup.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220329030108.97341-3-darcy.sh@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-13 13:37:37 -04:00
SU Hang	aecce510fe	KVM: VMX: replace 0x180 with EPT_VIOLATION_* definition Using self-expressing macro definition EPT_VIOLATION_GVA_VALIDATION and EPT_VIOLATION_GVA_TRANSLATED instead of 0x180 in FNAME(walk_addr_generic)(). Signed-off-by: SU Hang <darcy.sh@antgroup.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220329030108.97341-2-darcy.sh@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-13 13:37:19 -04:00
Paolo Bonzini	2a8859f373	KVM: x86/mmu: do compare-and-exchange of gPTE via the user address FNAME(cmpxchg_gpte) is an inefficient mess. It is at least decent if it can go through get_user_pages_fast(), but if it cannot then it tries to use memremap(); that is not just terribly slow, it is also wrong because it assumes that the VM_PFNMAP VMA is contiguous. The right way to do it would be to do the same thing as hva_to_pfn_remapped() does since commit `add6a0cd1c` ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05), using follow_pte() and fixup_user_fault() to determine the correct address to use for memremap(). To do this, one could for example extract hva_to_pfn() for use outside virt/kvm/kvm_main.c. But really there is no reason to do that either, because there is already a perfectly valid address to do the cmpxchg() on, only it is a userspace address. That means doing user_access_begin()/user_access_end() and writing the code in assembly to handle exceptions correctly. Worse, the guest PTE can be 8-byte even on i686 so there is the extra complication of using cmpxchg8b to account for. But at least it is an efficient mess. (Thanks to Linus for suggesting improvement on the inline assembly). Reported-by: Qiuhao Li <qiuhao@sysec.org> Reported-by: Gaoning Pan <pgn@zju.edu.cn> Reported-by: Yongkang Jia <kangel@zju.edu.cn> Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Fixes: `bd53cb35a3` ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:37:27 -04:00
Lai Jiangshan	5b22bbe717	KVM: X86: Change the type of access u32 to u64 Change the type of access u32 to u64 for FNAME(walk_addr) and ->gva_to_gpa(). The kinds of accesses are usually combinations of UWX, and VMX/SVM's nested paging adds a new factor of access: is it an access for a guest page table or for a final guest physical address. And SMAP relies a factor for supervisor access: explicit or implicit. So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include all these information to do the walk. Although @access(u32) has enough bits to encode all the kinds, this patch extends it to u64: o Extra bits will be in the higher 32 bits, so that we can easily obtain the traditional access mode (UWX) by converting it to u32. o Reuse the value for the access kind defined by SVM's nested paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as @error_code in kvm_handle_page_fault(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:34:42 -04:00
Paolo Bonzini	b9e5603c2a	KVM: x86: use struct kvm_mmu_root_info for mmu->root The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:16 -05:00
Sean Christopherson	1bbc60d0c7	KVM: x86/mmu: Remove MMU auditing Remove mmu_audit.c and all its collateral, the auditing code has suffered severe bitrot, ironically partly due to shadow paging being more stable and thus not benefiting as much from auditing, but mostly due to TDP supplanting shadow paging for non-nested guests and shadowing of nested TDP not heavily stressing the logic that is being audited. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-18 13:46:23 -05:00
Lai Jiangshan	c59a0f57fa	KVM: X86: Remove mmu->translate_gpa Reduce an indirect function call (retpoline) and some intialization code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:11 -05:00

1 2 3

133 Commits