linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2025-01-13 09:15:02 +08:00

Author	SHA1	Message	Date
Paolo Bonzini	a1a39128fa	KVM: MMU: propagate alloc_workqueue failure If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm. kvm_arch_init_vm also has to undo all the initialization, so group all the MMU initialization code at the beginning and handle cleaning up of kvm_page_track_init. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:34:38 -04:00
Paolo Bonzini	873dd12217	Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()" This reverts commit `cf3e26427c`. Multi-vCPU Hyper-V guests started crashing randomly on boot with the latest kvm/queue and the problem can be bisected the problem to this particular patch. Basically, I'm not able to boot e.g. 16-vCPU guest successfully anymore. Both Intel and AMD seem to be affected. Reverting the commit saves the day. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-21 05:11:51 -04:00
Paolo Bonzini	fcb93eb6d0	kvm: x86/mmu: Flush TLB before zap_gfn_range releases RCU Since "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()" is going to be reverted, it's not going to be true anymore that the zap-page flow does not free any 'struct kvm_mmu_page'. Introduce an early flush before tdp_mmu_zap_leafs() returns, to preserve bisectability. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-21 05:11:51 -04:00
Sean Christopherson	396fd74d61	KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE Disallow calling tdp_mmu_set_spte_atomic() with a REMOVED "old" SPTE. This solves a conundrum introduced by commit `3255530ab1` ("KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails"); if the helper doesn't update old_spte in the REMOVED case, then theoretically the caller could get stuck in an infinite loop as it will fail indefinitely on the REMOVED SPTE. E.g. until recently, clear_dirty_gfn_range() didn't check for a present SPTE and would have spun until getting rescheduled. In practice, only the page fault path should "create" a new SPTE, all other paths should only operate on existing, a.k.a. shadow present, SPTEs. Now that the page fault path pre-checks for a REMOVED SPTE in all cases, require all other paths to indirectly pre-check by verifying the target SPTE is a shadow-present SPTE. Note, this does not guarantee the actual SPTE isn't REMOVED, nor is that scenario disallowed. The invariant is only that the caller mustn't invoke tdp_mmu_set_spte_atomic() if the SPTE was REMOVED when last observed by the caller. Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 10:59:10 -05:00
Sean Christopherson	58298b0681	KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE Explicitly check for a REMOVED leaf SPTE prior to attempting to map the final SPTE when handling a TDP MMU fault. Functionally, this is a nop as tdp_mmu_set_spte_atomic() will eventually detect the frozen SPTE. Pre-checking for a REMOVED SPTE is a minor optmization, but the real goal is to allow tdp_mmu_set_spte_atomic() to have an invariant that the "old" SPTE is never a REMOVED SPTE. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 10:59:09 -05:00
Paolo Bonzini	efd995dae5	KVM: x86/mmu: Zap defunct roots via asynchronous worker Zap defunct roots, a.k.a. roots that have been invalidated after their last reference was initially dropped, asynchronously via the existing work queue instead of forcing the work upon the unfortunate task that happened to drop the last reference. If a vCPU task drops the last reference, the vCPU is effectively blocked by the host for the entire duration of the zap. If the root being zapped happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging being active, the zap can take several hundred seconds. Unsurprisingly, most guests are unhappy if a vCPU disappears for hundreds of seconds. E.g. running a synthetic selftest that triggers a vCPU root zap with ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds. Offloading the zap to a worker drops the block time to <100ms. There is an important nuance to this change. If the same work item was queued twice before the work function has run, it would only execute once and one reference would be leaked. Therefore, now that queueing and flushing items is not anymore protected by kvm->slots_lock, kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and skip already invalid roots. On the other hand, kvm_mmu_zap_all_fast() must return only after those skipped roots have been zapped as well. These two requirements can be satisfied only if _all_ places that change invalid to true now schedule the worker before releasing the mmu_lock. There are just two, kvm_tdp_mmu_put_root() and kvm_tdp_mmu_invalidate_all_roots(). Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 10:57:11 -05:00
Sean Christopherson	1b6043e8e5	KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls When zapping a TDP MMU root, perform the zap in two passes to avoid zapping an entire top-level SPTE while holding RCU, which can induce RCU stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then zap top-level entries in the second pass. With 4-level paging, zapping a PGD that is fully populated with 4kb leaf SPTEs take up to ~7 or so seconds (time varies based on kernel config, number of (v)CPUs, etc...). With 5-level paging, that time can balloon well into hundreds of seconds. Before remote TLB flushes were omitted, the problem was even worse as waiting for all active vCPUs to respond to the IPI introduced significant overhead for VMs with large numbers of vCPUs. By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass, the amount of work that is done without dropping RCU protection is strictly bounded, with the worst case latency for a single operation being less than 100ms. Zapping at 1gb in the first pass is not arbitrary. First and foremost, KVM relies on being able to zap 1gb shadow pages in a single shot when when repacing a shadow page with a hugepage. Zapping a 1gb shadow page that is fully populated with 4kb dirty SPTEs also triggers the worst case latency due writing back the struct page accessed/dirty bits for each 4kb page, i.e. the two-pass approach is guaranteed to work so long as KVM can cleany zap a 1gb shadow page. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000 softirq=15759/15759 fqs=5058 (t=21016 jiffies g=66453 q=238577) NMI backtrace for cpu 52 Call Trace: ... mark_page_accessed+0x266/0x2f0 kvm_set_pfn_accessed+0x31/0x40 handle_removed_tdp_mmu_page+0x259/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 zap_gfn_range+0x141/0x3b0 kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130 kvm_mmu_zap_all_fast+0x121/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x172/0x4e0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0x49c/0xf80 Reported-by: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 10:57:09 -05:00
Paolo Bonzini	8351779ce6	KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root Allow yielding when zapping SPTEs after the last reference to a valid root is put. Because KVM must drop all SPTEs in response to relevant mmu_notifier events, mark defunct roots invalid and reset their refcount prior to zapping the root. Keeping the refcount elevated while the zap is in-progress ensures the root is reachable via mmu_notifier until the zap completes and the last reference to the invalid, defunct root is put. Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the root in being put has a massive paging structure, e.g. zapping a root that is backed entirely by 4kb pages for a guest with 32tb of memory can take hundreds of seconds to complete. watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368] RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm] __handle_changed_spte+0x1b2/0x2f0 [kvm] handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm] __handle_changed_spte+0x1f4/0x2f0 [kvm] handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm] __handle_changed_spte+0x1f4/0x2f0 [kvm] tdp_mmu_zap_root+0x307/0x4d0 [kvm] kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm] kvm_mmu_free_roots+0x22d/0x350 [kvm] kvm_mmu_reset_context+0x20/0x60 [kvm] kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm] kvm_vcpu_ioctl+0x5bd/0x710 [kvm] __se_sys_ioctl+0x77/0xc0 __x64_sys_ioctl+0x1d/0x20 do_syscall_64+0x44/0xa0 entry_SYSCALL_64_after_hwframe+0x44/0xae KVM currently doesn't put a root from a non-preemptible context, so other than the mmu_notifier wrinkle, yielding when putting a root is safe. Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't take a reference to each root (it requires mmu_lock be held for the entire duration of the walk). tdp_mmu_next_root() is used only by the yield-friendly iterator. tdp_mmu_zap_root_work() is explicitly yield friendly. kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out, but is still yield-friendly in all call sites, as all callers can be traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or kvm_create_vm(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 10:57:09 -05:00
Paolo Bonzini	22b94c4b63	KVM: x86/mmu: Zap invalidated roots via asynchronous worker Use the system worker threads to zap the roots invalidated by the TDP MMU's "fast zap" mechanism, implemented by kvm_tdp_mmu_invalidate_all_roots(). At this point, apart from allowing some parallelism in the zapping of roots, the workqueue is a glorified linked list: work items are added and flushed entirely within a single kvm->slots_lock critical section. However, the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots() assumes that it owns a reference to all invalid roots; therefore, no one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the invalidated roots on a linked list... erm, on a workqueue ensures that tdp_mmu_zap_root_work() only puts back those extra references that kvm_mmu_zap_all_invalidated_roots() had gifted to it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 10:55:27 -05:00
Sean Christopherson	bb95dfb9e2	KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead of immediately flushing. Because the shadow pages are freed in an RCU callback, so long as at least one CPU holds RCU, all CPUs are protected. For vCPUs running in the guest, i.e. consuming TLB entries, KVM only needs to ensure the caller services the pending TLB flush before dropping its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs running in the guest. Deferring the flushes allows batching flushes, e.g. when installing a 1gb hugepage and zapping a pile of SPs. And when zapping an entire root, deferring flushes allows skipping the flush entirely (because flushes are not needed in that case). Avoiding flushes when zapping an entire root is especially important as synchronizing with other CPUs via IPI after zapping every shadow page can cause significant performance issues for large VMs. The issue is exacerbated by KVM zapping entire top-level entries without dropping RCU protection, which can lead to RCU stalls even when zapping roots backing relatively "small" amounts of guest memory, e.g. 2tb. Removing the IPI bottleneck largely mitigates the RCU issues, though it's likely still a problem for 5-level paging. A future patch will further address the problem by zapping roots in multiple passes to avoid holding RCU for an extended duration. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:57 -05:00
Sean Christopherson	bd29677952	KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched When yielding in the TDP MMU iterator, service any pending TLB flush before dropping RCU protections in anticipation of using the caller's RCU "lock" as a proxy for vCPUs in the guest. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:56 -05:00
Sean Christopherson	cf3e26427c	KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various functions accordingly. When removing mappings for functional correctness (except for the stupid VFIO GPU passthrough memslots bug), zapping the leaf SPTEs is sufficient as the paging structures themselves do not point at guest memory and do not directly impact the final translation (in the TDP MMU). Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and kvm_unmap_gfn_range(). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:56 -05:00
Sean Christopherson	acbda82a81	KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range Now that all callers of zap_gfn_range() hold mmu_lock for write, drop support for zapping with mmu_lock held for read. That all callers hold mmu_lock for write isn't a random coincidence; now that the paths that need to zap _everything_ have their own path, the only callers left are those that need to zap for functional correctness. And when zapping is required for functional correctness, mmu_lock must be held for write, otherwise the caller has no guarantees about the state of the TDP MMU page tables after it has run, e.g. the SPTE(s) it zapped can be immediately replaced by a vCPU faulting in a page. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:55 -05:00
Sean Christopherson	e2b5b21d3a	KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page Add a dedicated helper for zapping a TDP MMU root, and use it in the three flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs are zapped (zapping an entire root is safe if and only if it cannot be in use by any vCPU). Because a TLB flush is never required, unconditionally pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding. Opportunistically document why KVM must not yield when zapping roots that are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has reached zero, and further harden the flow to detect improper KVM behavior with respect to roots that are supposed to be unreachable. In addition to hardening zapping of roots, isolating zapping of roots will allow future simplification of zap_gfn_range() by having it zap only leaf SPTEs, and by removing its tricky "zap all" heuristic. By having all paths that truly need to free _all_ SPs flow through the dedicated root zapper, the generic zapper can be freed of those concerns. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:55 -05:00
Sean Christopherson	77c8cd6b85	KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM uses the slow version of "zap everything" is when the VM is being destroyed or the owning mm has exited. In either case, KVM_RUN is unreachable for the VM, i.e. the guest TLB entries cannot be consumed. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-15-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:54 -05:00
Sean Christopherson	c10743a182	KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery When recovering a potential hugepage that was shattered for the iTLB multihit workaround, precisely zap only the target page instead of iterating over the TDP MMU to find the SP that was passed in. This will allow future simplification of zap_gfn_range() by having it zap only leaf SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:54 -05:00
Sean Christopherson	626808d137	KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values Refactor __tdp_mmu_set_spte() to work with raw values instead of a tdp_iter objects so that a future patch can modify SPTEs without doing a walk, and without having to synthesize a tdp_iter. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-13-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:53 -05:00
Sean Christopherson	966da62ada	KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE, which is called out by the comment as being disallowed but not actually checked. Keep the WARN on the old_spte as well, because overwriting a REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by lack of splats with the existing WARN). Fixes: `08f07c800e` ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-12-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:53 -05:00
Sean Christopherson	0e587aa733	KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU Add helpers to read and write TDP MMU SPTEs instead of open coding rcu_dereference() all over the place, and to provide a convenient location to document why KVM doesn't exempt holding mmu_lock for write from having to hold RCU (and any future changes to the rules). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-11-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:52 -05:00
Sean Christopherson	a151aceca1	KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks Drop RCU protection after processing each root when handling MMU notifier hooks that aren't the "unmap" path, i.e. aren't zapping. Temporarily drop RCU to let RCU do its thing between roots, and to make it clear that there's no special behavior that relies on holding RCU across all roots. Currently, the RCU protection is completely superficial, it's necessary only to make rcu_dereference() of SPTE pointers happy. A future patch will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to ensure shadow pages aren't freed before all vCPUs do a TLB flush (or rather, acknowledge the need for a flush), but in that case RCU needs to be held until the flush is complete if and only if the flush is needed because a shadow page may have been removed. And except for the "unmap" path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit, and can't change the PFN for a SP). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-10-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:52 -05:00
Sean Christopherson	93fa50f644	KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte Batch TLB flushes (with other MMUs) when handling ->change_spte() notifications in the TDP MMU. The MMU notifier path in question doesn't allow yielding and correcty flushes before dropping mmu_lock. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-9-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:51 -05:00
Sean Christopherson	c8e5a0d0e9	KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal Look for a !leaf=>leaf conversion instead of a PFN change when checking if a SPTE change removed a TDP MMU shadow page. Convert the PFN check into a WARN, as KVM should never change the PFN of a shadow page (except when its being zapped or replaced). From a purely theoretical perspective, it's not illegal to replace a SP with a hugepage pointing at the same PFN. In practice, it's impossible as that would require mapping guest memory overtop a kernel-allocated SP. Either way, the check is odd. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-8-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:51 -05:00
Paolo Bonzini	614f6970aa	KVM: x86/mmu: do not allow readers to acquire references to invalid roots Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring that readers do not ever acquire a reference to an invalid root. After this patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the same way. kvm_tdp_mmu_zap_invalidated_roots() is different but it also does not acquire a reference to the invalid root, and it cannot see refcount=0/invalid because it is guaranteed to run after kvm_tdp_mmu_invalidate_all_roots(). Opportunistically add a lockdep assertion to the yield-safe iterator. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:50 -05:00
Paolo Bonzini	7c554d8e51	KVM: x86/mmu: only perform eager page splitting on valid roots Eager page splitting is an optimization; it does not have to be performed on invalid roots. It is also the only case in which a reader might acquire a reference to an invalid root, so after this change we know that readers will skip both dying and invalid roots. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:50 -05:00
Sean Christopherson	226b8c8f85	KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter Assert that mmu_lock is held for write by users of the yield-unfriendly TDP iterator. The nature of a shared walk means that the caller needs to play nice with other tasks modifying the page tables, which is more or less the same thing as playing nice with yielding. Theoretically, KVM could gain a flow where it could legitimately take mmu_lock for read in a non-preemptible context, but that's highly unlikely and any such case should be viewed with a fair amount of scrutiny. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:47 -05:00
Sean Christopherson	7ae5840e6f	KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush Remove the misleading flush "handling" when zapping invalidated TDP MMU roots, and document that flushing is unnecessary for all flavors of MMUs when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU is dead code, as zap_gfn_range() is called with shared=true, in which case it will never return true due to the flushing being handled by tdp_mmu_zap_spte_atomic(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:36 -05:00
Sean Christopherson	db01416b22	KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic Explicitly ignore the result of zap_gfn_range() when putting the last reference to a TDP MMU root, and add a pile of comments to formalize the TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this only affects the !shared case, as zap_gfn_range() subtly never returns true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic(). Putting the root without a flush is ok because even if there are stale references to the root in the TLB, they are unreachable because KVM will not run the guest with the same ASID without first flushing (where ASID in this context refers to both SVM's explicit ASID and Intel's implicit ASID that is constructed from VPID+PCID+EPT4A+etc...). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-5-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:23 -05:00
Sean Christopherson	f28e9c7fce	KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Fix misleading and arguably wrong comments in the TDP MMU's fast zap flow. The comments, and the fact that actually zapping invalid roots was added separately, strongly suggests that zapping invalid roots is an optimization and not required for correctness. That is a lie. KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(), because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(), KVM is relying on it to fully remove all references to the memslot. Once the memslot is gone, KVM's mmu_notifier hooks will be unable to find the stale references as the hva=>gfn translation is done via the memslots. If KVM doesn't immediately zap SPTEs and userspace unmaps a range after deleting a memslot, KVM will fail to zap in response to the mmu_notifier due to not finding a memslot corresponding to the notifier's range, which leads to a variation of use-after-free. The other misleading comment (and code) explicitly states that roots without a reference should be skipped. While that's technically true, it's also extremely misleading as it should be impossible for KVM to encounter a defunct root on the list while holding mmu_lock for write. Opportunistically add a WARN to enforce that invariant. Fixes: `b7cccd397f` ("KVM: x86/mmu: Fast invalidation for TDP MMU") Fixes: `4c6654bd16` ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:18 -05:00
Sean Christopherson	3354ef5a59	KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU Explicitly check for present SPTEs when clearing dirty bits in the TDP MMU. This isn't strictly required for correctness, as setting the dirty bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT. However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would complain if anyone actually turned on KVM's MMU debugging. Fixes: `a6a0b05da9` ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-3-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:17 -05:00
Paolo Bonzini	b9e5603c2a	KVM: x86: use struct kvm_mmu_root_info for mmu->root The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:16 -05:00
David Matlack	e0b728b1f1	KVM: x86/mmu: Add tracepoint for splitting huge pages Add a tracepoint that records whenever KVM eagerly splits a huge page and the error status of the split to indicate if it succeeded or failed and why. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-18-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:43 -05:00
David Matlack	cb00a70bd4	KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not write-protected when dirty logging is enabled on the memslot. Instead they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for the first time and only for the specific sub-region being cleared. Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to write-protecting to avoid causing write-protection faults on vCPU threads. This also allows userspace to smear the cost of huge page splitting across multiple ioctls, rather than splitting the entire memslot as is the case when initially-all-set is not used. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-17-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:43 -05:00
David Matlack	a3fe5dbda0	KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:42 -05:00
David Matlack	a82070b6e7	KVM: x86/mmu: Separate TDP MMU shadow page allocation and initialization Separate the allocation of shadow pages from their initialization. This is in preparation for splitting huge pages outside of the vCPU fault context, which requires a different allocation mechanism. No functional changed intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-15-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:41 -05:00
David Matlack	a3aca4de0d	KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent Derive the page role from the parent shadow page, since the only thing that changes is the level. This is in preparation for splitting huge pages during VM-ioctls which do not have access to the vCPU MMU context. No functional change intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-14-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:41 -05:00
David Matlack	a81399a573	KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages The vCPU's mmu_role already has the correct values for direct, has_4_byte_gpte, access, and ad_disabled. Remove the code that was redundantly overwriting these fields with the same values. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-13-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:41 -05:00
David Matlack	77aa60753a	KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root Instead of passing a pointer to the root page table and the root level separately, pass in a pointer to the root kvm_mmu_page struct. This reduces the number of arguments by 1, cutting down on line lengths. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-12-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:40 -05:00
David Matlack	7b7e1ab6fd	KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table Consolidate the logic to atomically replace an SPTE with an SPTE that points to a new page table into a single helper function. This will be used in a follow-up commit to split huge pages, which involves replacing each huge page SPTE with an SPTE that points to a page table. Opportunistically drop the call to trace_kvm_mmu_get_page() in kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in tdp_mmu_alloc_sp(). No functional change intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:39 -05:00
David Matlack	0f53dfa34e	KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to handle_removed_pt() First remove tdp_mmu_ from the name since it is redundant given that it is a static function in tdp_mmu.c. There is a pattern of using tdp_mmu_ as a prefix in the names of static TDP MMU functions, but all of the other handle_*() variants do not include such a prefix. So drop it entirely. Then change "page" to "pt" to convey that this is operating on a page table rather than an struct page. Purposely use "pt" instead of "sp" since this function takes the raw RCU-protected page table pointer as an argument rather than a pointer to the struct kvm_mmu_page. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:38 -05:00
David Matlack	c298a30c28	KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages Rename 3 functions in tdp_mmu.c that handle shadow pages: alloc_tdp_mmu_page() -> tdp_mmu_alloc_sp() tdp_mmu_link_page() -> tdp_mmu_link_sp() tdp_mmu_unlink_page() -> tdp_mmu_unlink_sp() These changed make tdp_mmu a consistent prefix before the verb in the function name, and make it more clear that these functions deal with kvm_mmu_page structs rather than struct pages. One could argue that "shadow page" is the wrong term for a page table in the TDP MMU since it never actually shadows a guest page table. However, "shadow page" (or "sp" for short) has evolved to become the standard term in KVM when referring to a kvm_mmu_page struct, and its associated page table and other metadata, regardless of whether the page table shadows a guest page table. So this commit just makes the TDP MMU more consistent with the rest of KVM. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:38 -05:00
David Matlack	3e72c791fd	KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() return a bool with true indicating the SPTE modification was successful and false indicating failure. Change these functions to return an int instead since that is the common practice. Opportunistically fix up the kernel-doc style for the Return section above tdp_mmu_set_spte_atomic(). No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:37 -05:00
David Matlack	3255530ab1	KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails Consolidate a bunch of code that was manually re-reading the spte if the cmpxchg failed. There is no extra cost of doing this because we already have the spte value as a result of the cmpxchg (and in fact this eliminates re-reading the spte), and none of the call sites depend on iter->old_spte retaining the stale spte value. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:37 -05:00
David Matlack	115111efd9	KVM: x86/mmu: Check SPTE writable invariants when setting leaf SPTEs Check SPTE writable invariants when setting SPTEs rather than in spte_can_locklessly_be_made_writable(). By the time KVM checks spte_can_locklessly_be_made_writable(), the SPTE has long been since corrupted. Note that these invariants only apply to shadow-present leaf SPTEs (i.e. not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the restriction and only instrument the code paths that set shadow-present leaf SPTEs. To account for access tracking, also check the SPTE writable invariants when marking an SPTE as an access track SPTE. This also lets us remove a redundant WARN from mark_spte_for_access_track(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:32 -05:00
Jinrong Liang	ad6d6b949e	KVM: x86/tdp_mmu: Remove unused "kvm" of kvm_tdp_mmu_get_root() The "struct kvm *kvm" parameter of kvm_tdp_mmu_get_root() is not used, so remove it. No functional change intended. Signed-off-by: Jinrong Liang <cloudliang@tencent.com> Message-Id: <20220125095909.38122-5-cloudliang@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:47:10 -05:00
Sean Christopherson	d62007edf0	KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU Zap both valid and invalid roots when zapping/unmapping a gfn range, as KVM must ensure it holds no references to the freed page after returning from the unmap operation. Most notably, the TDP MMU doesn't zap invalid roots in mmu_notifier callbacks. This leads to use-after-free and other issues if the mmu_notifier runs to completion while an invalid root zapper yields as KVM fails to honor the requirement that there must be _no_ references to the page after the mmu_notifier returns. The bug is most easily reproduced by hacking KVM to cause a collision between set_nx_huge_pages() and kvm_mmu_notifier_release(), but the bug exists between kvm_mmu_notifier_invalidate_range_start() and memslot updates as well. Invalidating a root ensures pages aren't accessible by the guest, and KVM won't read or write page data itself, but KVM will trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing a zap of an invalid root _after_ the mmu_notifier returns is fatal. WARNING: CPU: 24 PID: 1496 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:173 [kvm] RIP: 0010:kvm_is_zone_device_pfn+0x96/0xa0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0xa8/0xe0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] zap_gfn_range+0x1f3/0x310 [kvm] kvm_tdp_mmu_zap_invalidated_roots+0x50/0x90 [kvm] kvm_mmu_zap_all_fast+0x177/0x1a0 [kvm] set_nx_huge_pages+0xb4/0x190 [kvm] param_attr_store+0x70/0x100 module_attr_store+0x19/0x30 kernfs_fop_write_iter+0x119/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1cc/0x270 ksys_write+0x5f/0xe0 do_syscall_64+0x38/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Fixes: `b7cccd397f` ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211215011557.399940-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:47:07 -05:00
Sean Christopherson	04dc4e6ce2	KVM: x86/mmu: Move "invalid" check out of kvm_tdp_mmu_get_root() Move the check for an invalid root out of kvm_tdp_mmu_get_root() and into the one place it actually matters, tdp_mmu_next_root(), as the other user already has an implicit validity check. A future bug fix will need to get references to invalid roots to honor mmu_notifier requests; there's no point in forcing what will be a common path to open code getting a reference to a root. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211215011557.399940-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:47:07 -05:00
Sean Christopherson	83b83a0207	KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook Use the common TDP MMU zap helper when handling an MMU notifier unmap event, the two flows are semantically identical. Consolidate the code in preparation for a future bug fix, as both kvm_tdp_mmu_unmap_gfn_range() and __kvm_tdp_mmu_zap_gfn_range() are guilty of not zapping SPTEs in invalid roots. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211215011557.399940-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:47:06 -05:00
David Matlack	7c8a4742c4	KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMU When the TDP MMU is write-protection GFNs for page table protection (as opposed to for dirty logging, or due to the HVA not being writable), it checks if the SPTE is already write-protected and if so skips modifying the SPTE and the TLB flush. This behavior is incorrect because it fails to check if the SPTE is write-protected for page table protection, i.e. fails to check that MMU-writable is '0'. If the SPTE was write-protected for dirty logging but not page table protection, the SPTE could locklessly be made writable, and vCPUs could still be running with writable mappings cached in their TLB. Fix this by only skipping setting the SPTE if the SPTE is already write-protected and MMU-writable is already clear. Technically, checking only MMU-writable would suffice; a SPTE cannot be writable without MMU-writable being set. But check both to be paranoid and because it arguably yields more readable code. Fixes: `46044f72c3` ("kvm: x86/mmu: Support write protection for nesting in tdp MMU") Cc: stable@vger.kernel.org Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220113233020.3986005-2-dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-01-19 12:06:26 -05:00
Paolo Bonzini	855fb0384a	Merge remote-tracking branch 'kvm/master' into HEAD Pick commit `fdba608f15` ("KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU"). In addition to fixing a bug, it also aligns the non-nested and nested usage of triggering posted interrupts, allowing for additional cleanups. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-21 12:51:09 -05:00
Sean Christopherson	3a0f64de47	KVM: x86/mmu: Don't advance iterator after restart due to yielding After dropping mmu_lock in the TDP MMU, restart the iterator during tdp_iter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding. When zapping all SPTEs, i.e. when min_level == root_level, restarting the iter and then invoking tdp_iter_next() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in try_step_side() skipping the current gfn, which wasn't visited before yielding. Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with __must_check to reduce the probabily of improper usage. Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), the shadow page will be leaked and KVM will WARN accordingly. WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2a0 [kvm] kvm_vcpu_release+0x34/0x60 [kvm] __fput+0x82/0x240 task_work_run+0x5c/0x90 do_exit+0x364/0xa10 ? futex_unqueue+0x38/0x60 do_group_exit+0x33/0xa0 get_signal+0x155/0x850 arch_do_signal_or_restart+0xed/0x750 exit_to_user_mode_prepare+0xc5/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel. WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0x120/0x1d0 [kvm] __handle_changed_spte+0x92e/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] zap_gfn_range+0x549/0x620 [kvm] kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] mmu_free_root_page+0x219/0x2c0 [kvm] kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] kvm_mmu_unload+0x1c/0xa0 [kvm] kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] kvm_put_kvm+0x3b1/0x8b0 [kvm] kvm_vcpu_release+0x4e/0x70 [kvm] __fput+0x1f7/0x8c0 task_work_run+0xf8/0x1a0 do_exit+0x97b/0x2230 do_group_exit+0xda/0x2a0 get_signal+0x3be/0x1e50 arch_do_signal_or_restart+0x244/0x17f0 exit_to_user_mode_prepare+0xcb/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Note, the underlying bug existed even before commit `1af4a96025` ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn. Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdp_mmu_iter_cond_resched() _must_ be called before anything else in the loop, and there's no easy way to enfornce that requirement. Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do while also supporting yielding after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE may block operations on the SPTE for a significant amount of time. Fixes: `faaf05b00a` ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: `1af4a96025` ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Reported-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211214033528.123268-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-20 08:06:53 -05:00

1 2 3 4

190 Commits