2020-10-15 02:26:43 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
KVM: x86: Unify pr_fmt to use module name for all KVM modules
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code. In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.
Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.
Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.
Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-01 07:09:18 +08:00
|
|
|
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
2020-10-15 02:26:43 +08:00
|
|
|
|
2020-10-15 02:26:44 +08:00
|
|
|
#include "mmu.h"
|
|
|
|
#include "mmu_internal.h"
|
2020-10-15 02:26:50 +08:00
|
|
|
#include "mmutrace.h"
|
2020-10-15 02:26:45 +08:00
|
|
|
#include "tdp_iter.h"
|
2020-10-15 02:26:43 +08:00
|
|
|
#include "tdp_mmu.h"
|
2020-10-15 02:26:44 +08:00
|
|
|
#include "spte.h"
|
2020-10-15 02:26:43 +08:00
|
|
|
|
2021-02-03 02:57:26 +08:00
|
|
|
#include <asm/cmpxchg.h>
|
2020-10-28 01:59:43 +08:00
|
|
|
#include <trace/events/kvm.h>
|
|
|
|
|
2020-10-15 02:26:43 +08:00
|
|
|
/* Initializes the TDP MMU for the VM, if enabled. */
|
2022-03-26 00:42:52 +08:00
|
|
|
int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
|
2020-10-15 02:26:43 +08:00
|
|
|
{
|
2022-03-26 00:42:52 +08:00
|
|
|
struct workqueue_struct *wq;
|
|
|
|
|
|
|
|
wq = alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
|
|
|
|
if (!wq)
|
|
|
|
return -ENOMEM;
|
2020-10-15 02:26:43 +08:00
|
|
|
|
2020-10-15 02:26:44 +08:00
|
|
|
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
|
2021-02-03 02:57:26 +08:00
|
|
|
spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
|
2022-03-26 00:42:52 +08:00
|
|
|
kvm->arch.tdp_mmu_zap_wq = wq;
|
|
|
|
return 1;
|
2020-10-15 02:26:43 +08:00
|
|
|
}
|
|
|
|
|
2022-02-26 08:15:24 +08:00
|
|
|
/* Arbitrarily returns true so that this may be used in if statements. */
|
|
|
|
static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
|
2021-04-02 07:37:32 +08:00
|
|
|
bool shared)
|
|
|
|
{
|
|
|
|
if (shared)
|
|
|
|
lockdep_assert_held_read(&kvm->mmu_lock);
|
|
|
|
else
|
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
2022-02-26 08:15:24 +08:00
|
|
|
|
|
|
|
return true;
|
2021-04-02 07:37:32 +08:00
|
|
|
}
|
|
|
|
|
2020-10-15 02:26:43 +08:00
|
|
|
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
|
|
|
|
{
|
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-27 06:03:23 +08:00
|
|
|
/*
|
|
|
|
* Invalidate all roots, which besides the obvious, schedules all roots
|
|
|
|
* for zapping and thus puts the TDP MMU's reference to each root, i.e.
|
|
|
|
* ultimately frees all roots.
|
|
|
|
*/
|
|
|
|
kvm_tdp_mmu_invalidate_all_roots(kvm);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Destroying a workqueue also first flushes the workqueue, i.e. no
|
|
|
|
* need to invoke kvm_tdp_mmu_zap_invalidated_roots().
|
|
|
|
*/
|
2022-03-03 01:02:07 +08:00
|
|
|
destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
|
|
|
|
|
2022-10-20 00:56:15 +08:00
|
|
|
WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
|
2020-10-15 02:26:44 +08:00
|
|
|
WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
|
2021-02-03 02:57:23 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure that all the outstanding RCU callbacks to free shadow pages
|
2022-03-03 01:02:07 +08:00
|
|
|
* can run before the VM is torn down. Work items on tdp_mmu_zap_wq
|
|
|
|
* can call kvm_tdp_mmu_put_root and create new callbacks.
|
2021-02-03 02:57:23 +08:00
|
|
|
*/
|
|
|
|
rcu_barrier();
|
2020-10-15 02:26:44 +08:00
|
|
|
}
|
|
|
|
|
2021-04-02 07:37:27 +08:00
|
|
|
static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
|
2021-01-07 08:19:34 +08:00
|
|
|
{
|
2021-04-02 07:37:27 +08:00
|
|
|
free_page((unsigned long)sp->spt);
|
|
|
|
kmem_cache_free(mmu_page_header_cache, sp);
|
2021-01-07 08:19:34 +08:00
|
|
|
}
|
|
|
|
|
2021-04-02 07:37:31 +08:00
|
|
|
/*
|
|
|
|
* This is called through call_rcu in order to free TDP page table memory
|
|
|
|
* safely with respect to other kernel threads that may be operating on
|
|
|
|
* the memory.
|
|
|
|
* By only accessing TDP MMU page table memory in an RCU read critical
|
|
|
|
* section, and freeing it after a grace period, lockless access to that
|
|
|
|
* memory won't use it after it is freed.
|
|
|
|
*/
|
|
|
|
static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
|
2021-01-07 08:19:34 +08:00
|
|
|
{
|
2021-04-02 07:37:31 +08:00
|
|
|
struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
|
|
|
|
rcu_head);
|
2021-01-07 08:19:34 +08:00
|
|
|
|
2021-04-02 07:37:31 +08:00
|
|
|
tdp_mmu_free_sp(sp);
|
|
|
|
}
|
2021-01-07 08:19:34 +08:00
|
|
|
|
2022-02-26 08:15:33 +08:00
|
|
|
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
|
|
|
|
bool shared);
|
|
|
|
|
2022-03-03 01:02:07 +08:00
|
|
|
static void tdp_mmu_zap_root_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
|
|
|
|
tdp_mmu_async_work);
|
|
|
|
struct kvm *kvm = root->tdp_mmu_async_data;
|
|
|
|
|
|
|
|
read_lock(&kvm->mmu_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A TLB flush is not necessary as KVM performs a local TLB flush when
|
|
|
|
* allocating a new root (see kvm_mmu_load()), and when migrating vCPU
|
|
|
|
* to a different pCPU. Note, the local TLB flush on reuse also
|
|
|
|
* invalidates any paging-structure-cache entries, i.e. TLB entries for
|
|
|
|
* intermediate paging structures, that may be zapped, as such entries
|
|
|
|
* are associated with the ASID on both VMX and SVM.
|
|
|
|
*/
|
|
|
|
tdp_mmu_zap_root(kvm, root, true);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
|
|
|
|
* avoiding an infinite loop. By design, the root is reachable while
|
|
|
|
* it's being asynchronously zapped, thus a different task can put its
|
|
|
|
* last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
|
|
|
|
* asynchronously zapped root is unavoidable.
|
|
|
|
*/
|
|
|
|
kvm_tdp_mmu_put_root(kvm, root, true);
|
|
|
|
|
|
|
|
read_unlock(&kvm->mmu_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
|
|
|
|
{
|
|
|
|
root->tdp_mmu_async_data = kvm;
|
|
|
|
INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
|
|
|
|
queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
|
|
|
|
}
|
|
|
|
|
2021-04-02 07:37:32 +08:00
|
|
|
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
|
|
|
|
bool shared)
|
2021-04-02 07:37:27 +08:00
|
|
|
{
|
2021-04-02 07:37:32 +08:00
|
|
|
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
|
2021-01-07 08:19:34 +08:00
|
|
|
|
2021-04-02 07:37:29 +08:00
|
|
|
if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
|
2021-04-02 07:37:27 +08:00
|
|
|
return;
|
|
|
|
|
2022-02-26 08:15:22 +08:00
|
|
|
/*
|
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-27 06:03:23 +08:00
|
|
|
* The TDP MMU itself holds a reference to each root until the root is
|
|
|
|
* explicitly invalidated, i.e. the final reference should be never be
|
|
|
|
* put for a valid root.
|
2022-02-26 08:15:22 +08:00
|
|
|
*/
|
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-27 06:03:23 +08:00
|
|
|
KVM_BUG_ON(!is_tdp_mmu_page(root) || !root->role.invalid, kvm);
|
2021-04-02 07:37:27 +08:00
|
|
|
|
KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root
Allow yielding when zapping SPTEs after the last reference to a valid
root is put. Because KVM must drop all SPTEs in response to relevant
mmu_notifier events, mark defunct roots invalid and reset their refcount
prior to zapping the root. Keeping the refcount elevated while the zap
is in-progress ensures the root is reachable via mmu_notifier until the
zap completes and the last reference to the invalid, defunct root is put.
Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
root in being put has a massive paging structure, e.g. zapping a root
that is backed entirely by 4kb pages for a guest with 32tb of memory can
take hundreds of seconds to complete.
watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
__handle_changed_spte+0x1b2/0x2f0 [kvm]
handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
__handle_changed_spte+0x1f4/0x2f0 [kvm]
handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
__handle_changed_spte+0x1f4/0x2f0 [kvm]
tdp_mmu_zap_root+0x307/0x4d0 [kvm]
kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
kvm_mmu_free_roots+0x22d/0x350 [kvm]
kvm_mmu_reset_context+0x20/0x60 [kvm]
kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
__se_sys_ioctl+0x77/0xc0
__x64_sys_ioctl+0x1d/0x20
do_syscall_64+0x44/0xa0
entry_SYSCALL_64_after_hwframe+0x44/0xae
KVM currently doesn't put a root from a non-preemptible context, so other
than the mmu_notifier wrinkle, yielding when putting a root is safe.
Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
take a reference to each root (it requires mmu_lock be held for the
entire duration of the walk).
tdp_mmu_next_root() is used only by the yield-friendly iterator.
tdp_mmu_zap_root_work() is explicitly yield friendly.
kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
but is still yield-friendly in all call sites, as all callers can be
traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
kvm_create_vm().
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-03 14:50:21 +08:00
|
|
|
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
|
|
|
|
list_del_rcu(&root->link);
|
|
|
|
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
|
2021-04-02 07:37:31 +08:00
|
|
|
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
|
2021-01-07 08:19:34 +08:00
|
|
|
}
|
|
|
|
|
2021-04-02 07:37:28 +08:00
|
|
|
/*
|
2021-12-15 09:15:56 +08:00
|
|
|
* Returns the next root after @prev_root (or the first root if @prev_root is
|
|
|
|
* NULL). A reference to the returned root is acquired, and the reference to
|
|
|
|
* @prev_root is released (the caller obviously must hold a reference to
|
|
|
|
* @prev_root if it's non-NULL).
|
|
|
|
*
|
|
|
|
* If @only_valid is true, invalid roots are skipped.
|
|
|
|
*
|
|
|
|
* Returns NULL if the end of tdp_mmu_roots was reached.
|
2021-04-02 07:37:28 +08:00
|
|
|
*/
|
|
|
|
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
|
2021-04-02 07:37:32 +08:00
|
|
|
struct kvm_mmu_page *prev_root,
|
2021-12-15 09:15:56 +08:00
|
|
|
bool shared, bool only_valid)
|
2021-01-07 08:19:34 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *next_root;
|
|
|
|
|
2021-04-02 07:37:31 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
|
2021-04-02 07:37:28 +08:00
|
|
|
if (prev_root)
|
2021-04-02 07:37:31 +08:00
|
|
|
next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
|
|
|
|
&prev_root->link,
|
|
|
|
typeof(*prev_root), link);
|
2021-04-02 07:37:28 +08:00
|
|
|
else
|
2021-04-02 07:37:31 +08:00
|
|
|
next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
|
|
|
|
typeof(*next_root), link);
|
2021-01-07 08:19:34 +08:00
|
|
|
|
2021-12-15 09:15:55 +08:00
|
|
|
while (next_root) {
|
2021-12-15 09:15:56 +08:00
|
|
|
if ((!only_valid || !next_root->role.invalid) &&
|
2022-01-25 17:58:54 +08:00
|
|
|
kvm_tdp_mmu_get_root(next_root))
|
2021-12-15 09:15:55 +08:00
|
|
|
break;
|
|
|
|
|
2021-04-02 07:37:31 +08:00
|
|
|
next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
|
|
|
|
&next_root->link, typeof(*next_root), link);
|
2021-12-15 09:15:55 +08:00
|
|
|
}
|
2021-04-02 07:37:30 +08:00
|
|
|
|
2021-04-02 07:37:31 +08:00
|
|
|
rcu_read_unlock();
|
2021-01-07 08:19:34 +08:00
|
|
|
|
2021-04-02 07:37:28 +08:00
|
|
|
if (prev_root)
|
2021-04-02 07:37:32 +08:00
|
|
|
kvm_tdp_mmu_put_root(kvm, prev_root, shared);
|
2021-01-07 08:19:34 +08:00
|
|
|
|
|
|
|
return next_root;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: this iterator gets and puts references to the roots it iterates over.
|
|
|
|
* This makes it safe to release the MMU lock and yield within the loop, but
|
|
|
|
* if exiting the loop early, the caller must drop the reference to the most
|
|
|
|
* recent root. (Unless keeping a live reference is desirable.)
|
2021-04-02 07:37:32 +08:00
|
|
|
*
|
|
|
|
* If shared is set, this function is operating under the MMU lock in read
|
|
|
|
* mode. In the unlikely event that this thread must free a root, the lock
|
|
|
|
* will be temporarily dropped and reacquired in write mode.
|
2021-01-07 08:19:34 +08:00
|
|
|
*/
|
2021-12-15 09:15:56 +08:00
|
|
|
#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, _only_valid)\
|
|
|
|
for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, _only_valid); \
|
|
|
|
_root; \
|
|
|
|
_root = tdp_mmu_next_root(_kvm, _root, _shared, _only_valid)) \
|
2022-03-02 21:51:05 +08:00
|
|
|
if (kvm_lockdep_assert_mmu_lock_held(_kvm, _shared) && \
|
|
|
|
kvm_mmu_page_as_id(_root) != _as_id) { \
|
2021-03-26 10:19:45 +08:00
|
|
|
} else
|
2021-01-07 08:19:34 +08:00
|
|
|
|
2021-12-15 09:15:56 +08:00
|
|
|
#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
|
|
|
|
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
|
|
|
|
|
2022-03-02 21:51:05 +08:00
|
|
|
#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
|
|
|
|
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, false, false)
|
2021-12-15 09:15:56 +08:00
|
|
|
|
2022-02-26 08:15:24 +08:00
|
|
|
/*
|
|
|
|
* Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
|
|
|
|
* the implication being that any flow that holds mmu_lock for read is
|
|
|
|
* inherently yield-friendly and should use the yield-safe variant above.
|
|
|
|
* Holding mmu_lock for write obviates the need for RCU protection as the list
|
|
|
|
* is guaranteed to be stable.
|
|
|
|
*/
|
|
|
|
#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
|
|
|
|
list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
|
|
|
|
if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
|
|
|
|
kvm_mmu_page_as_id(_root) != _as_id) { \
|
2021-03-26 10:19:45 +08:00
|
|
|
} else
|
2020-10-15 02:26:44 +08:00
|
|
|
|
2022-01-20 07:07:35 +08:00
|
|
|
static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
|
2020-10-15 02:26:44 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *sp;
|
|
|
|
|
|
|
|
sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
|
|
|
|
sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
|
2022-01-20 07:07:35 +08:00
|
|
|
|
|
|
|
return sp;
|
|
|
|
}
|
|
|
|
|
2022-02-26 08:15:31 +08:00
|
|
|
static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
|
|
|
|
gfn_t gfn, union kvm_mmu_page_role role)
|
2022-01-20 07:07:35 +08:00
|
|
|
{
|
2022-10-20 00:56:12 +08:00
|
|
|
INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
|
2022-10-20 00:56:11 +08:00
|
|
|
|
2020-10-15 02:26:44 +08:00
|
|
|
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
|
|
|
|
|
2022-01-20 07:07:34 +08:00
|
|
|
sp->role = role;
|
2020-10-15 02:26:44 +08:00
|
|
|
sp->gfn = gfn;
|
2022-02-26 08:15:31 +08:00
|
|
|
sp->ptep = sptep;
|
2020-10-15 02:26:44 +08:00
|
|
|
sp->tdp_mmu_page = true;
|
|
|
|
|
2020-10-28 01:59:43 +08:00
|
|
|
trace_kvm_mmu_get_page(sp, true);
|
2020-10-15 02:26:44 +08:00
|
|
|
}
|
|
|
|
|
2022-01-20 07:07:35 +08:00
|
|
|
static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
|
|
|
|
struct tdp_iter *iter)
|
2020-10-15 02:26:44 +08:00
|
|
|
{
|
2022-01-20 07:07:34 +08:00
|
|
|
struct kvm_mmu_page *parent_sp;
|
2020-10-15 02:26:44 +08:00
|
|
|
union kvm_mmu_page_role role;
|
2022-01-20 07:07:34 +08:00
|
|
|
|
|
|
|
parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
|
|
|
|
|
|
|
|
role = parent_sp->role;
|
|
|
|
role.level--;
|
|
|
|
|
2022-02-26 08:15:31 +08:00
|
|
|
tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
|
2022-01-20 07:07:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2022-02-14 21:46:24 +08:00
|
|
|
union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
|
2020-10-15 02:26:44 +08:00
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
|
2021-03-05 09:10:50 +08:00
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
2020-10-15 02:26:44 +08:00
|
|
|
|
2021-12-15 09:15:55 +08:00
|
|
|
/*
|
|
|
|
* Check for an existing root before allocating a new one. Note, the
|
|
|
|
* role check prevents consuming an invalid root.
|
|
|
|
*/
|
2021-03-26 10:19:45 +08:00
|
|
|
for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
|
2021-04-02 07:37:30 +08:00
|
|
|
if (root->role.word == role.word &&
|
2022-01-25 17:58:54 +08:00
|
|
|
kvm_tdp_mmu_get_root(root))
|
2021-03-05 09:10:50 +08:00
|
|
|
goto out;
|
2020-10-15 02:26:44 +08:00
|
|
|
}
|
|
|
|
|
2022-01-20 07:07:35 +08:00
|
|
|
root = tdp_mmu_alloc_sp(vcpu);
|
2022-02-26 08:15:31 +08:00
|
|
|
tdp_mmu_init_sp(root, NULL, 0, role);
|
2022-01-20 07:07:35 +08:00
|
|
|
|
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-27 06:03:23 +08:00
|
|
|
/*
|
|
|
|
* TDP MMU roots are kept until they are explicitly invalidated, either
|
|
|
|
* by a memslot update or by the destruction of the VM. Initialize the
|
|
|
|
* refcount to two; one reference for the vCPU, and one reference for
|
|
|
|
* the TDP MMU itself, which is held until the root is invalidated and
|
|
|
|
* is ultimately put by tdp_mmu_zap_root_work().
|
|
|
|
*/
|
|
|
|
refcount_set(&root->tdp_mmu_root_count, 2);
|
2020-10-15 02:26:44 +08:00
|
|
|
|
2021-04-02 07:37:31 +08:00
|
|
|
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
|
|
|
|
list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
|
|
|
|
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
|
2020-10-15 02:26:44 +08:00
|
|
|
|
2021-03-05 09:10:50 +08:00
|
|
|
out:
|
2020-10-15 02:26:44 +08:00
|
|
|
return __pa(root->spt);
|
2020-10-15 02:26:43 +08:00
|
|
|
}
|
2020-10-15 02:26:45 +08:00
|
|
|
|
|
|
|
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
|
2021-02-03 02:57:26 +08:00
|
|
|
u64 old_spte, u64 new_spte, int level,
|
|
|
|
bool shared);
|
2020-10-15 02:26:45 +08:00
|
|
|
|
2022-08-23 08:46:37 +08:00
|
|
|
static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
|
|
|
|
{
|
|
|
|
kvm_account_pgtable_pages((void *)sp->spt, +1);
|
2022-10-20 00:56:15 +08:00
|
|
|
atomic64_inc(&kvm->arch.tdp_mmu_pages);
|
2022-08-23 08:46:37 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void tdp_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
|
|
|
|
{
|
|
|
|
kvm_account_pgtable_pages((void *)sp->spt, -1);
|
2022-10-20 00:56:15 +08:00
|
|
|
atomic64_dec(&kvm->arch.tdp_mmu_pages);
|
2022-08-23 08:46:37 +08:00
|
|
|
}
|
|
|
|
|
2021-02-03 02:57:25 +08:00
|
|
|
/**
|
2022-01-20 07:07:26 +08:00
|
|
|
* tdp_mmu_unlink_sp() - Remove a shadow page from the list of used pages
|
2021-02-03 02:57:25 +08:00
|
|
|
*
|
|
|
|
* @kvm: kvm instance
|
|
|
|
* @sp: the page to be removed
|
2021-02-03 02:57:26 +08:00
|
|
|
* @shared: This operation may not be running under the exclusive use of
|
|
|
|
* the MMU lock and the operation must synchronize with other
|
|
|
|
* threads that might be adding or removing pages.
|
2021-02-03 02:57:25 +08:00
|
|
|
*/
|
2022-01-20 07:07:26 +08:00
|
|
|
static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
|
|
|
|
bool shared)
|
2021-02-03 02:57:25 +08:00
|
|
|
{
|
2022-08-23 08:46:37 +08:00
|
|
|
tdp_unaccount_mmu_page(kvm, sp);
|
2022-10-20 00:56:15 +08:00
|
|
|
|
|
|
|
if (!sp->nx_huge_page_disallowed)
|
|
|
|
return;
|
|
|
|
|
2021-02-03 02:57:26 +08:00
|
|
|
if (shared)
|
|
|
|
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
|
|
|
|
else
|
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
2021-02-03 02:57:25 +08:00
|
|
|
|
2022-10-20 00:56:15 +08:00
|
|
|
sp->nx_huge_page_disallowed = false;
|
|
|
|
untrack_possible_nx_huge_page(kvm, sp);
|
2021-02-03 02:57:26 +08:00
|
|
|
|
|
|
|
if (shared)
|
|
|
|
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
|
2021-02-03 02:57:25 +08:00
|
|
|
}
|
|
|
|
|
2021-02-03 02:57:11 +08:00
|
|
|
/**
|
2022-01-20 07:07:27 +08:00
|
|
|
* handle_removed_pt() - handle a page table removed from the TDP structure
|
2021-02-03 02:57:11 +08:00
|
|
|
*
|
|
|
|
* @kvm: kvm instance
|
|
|
|
* @pt: the page removed from the paging structure
|
2021-02-03 02:57:26 +08:00
|
|
|
* @shared: This operation may not be running under the exclusive use
|
|
|
|
* of the MMU lock and the operation must synchronize with other
|
|
|
|
* threads that might be modifying SPTEs.
|
2021-02-03 02:57:11 +08:00
|
|
|
*
|
|
|
|
* Given a page table that has been removed from the TDP paging structure,
|
|
|
|
* iterates through the page table to clear SPTEs and free child page tables.
|
2021-03-16 07:38:00 +08:00
|
|
|
*
|
|
|
|
* Note that pt is passed in as a tdp_ptep_t, but it does not need RCU
|
|
|
|
* protection. Since this thread removed it from the paging structure,
|
|
|
|
* this thread will be responsible for ensuring the page is freed. Hence the
|
|
|
|
* early rcu_dereferences in the function.
|
2021-02-03 02:57:11 +08:00
|
|
|
*/
|
2022-01-20 07:07:27 +08:00
|
|
|
static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
|
2021-02-03 02:57:11 +08:00
|
|
|
{
|
2021-03-16 07:38:00 +08:00
|
|
|
struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
|
2021-02-03 02:57:11 +08:00
|
|
|
int level = sp->role.level;
|
2021-02-03 02:57:28 +08:00
|
|
|
gfn_t base_gfn = sp->gfn;
|
2021-02-03 02:57:11 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
trace_kvm_mmu_prepare_zap_page(sp);
|
|
|
|
|
2022-01-20 07:07:26 +08:00
|
|
|
tdp_mmu_unlink_sp(kvm, sp, shared);
|
2021-02-03 02:57:11 +08:00
|
|
|
|
2022-06-15 07:33:25 +08:00
|
|
|
for (i = 0; i < SPTE_ENT_PER_PAGE; i++) {
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
tdp_ptep_t sptep = pt + i;
|
2021-11-16 05:17:04 +08:00
|
|
|
gfn_t gfn = base_gfn + i * KVM_PAGES_PER_HPAGE(level);
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
u64 old_spte;
|
2021-02-03 02:57:26 +08:00
|
|
|
|
|
|
|
if (shared) {
|
2021-02-03 02:57:28 +08:00
|
|
|
/*
|
|
|
|
* Set the SPTE to a nonpresent value that other
|
|
|
|
* threads will not overwrite. If the SPTE was
|
|
|
|
* already marked as removed then another thread
|
|
|
|
* handling a page fault could overwrite it, so
|
|
|
|
* set the SPTE until it is set from some other
|
|
|
|
* value to the removed SPTE value.
|
|
|
|
*/
|
|
|
|
for (;;) {
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
old_spte = kvm_tdp_mmu_write_spte_atomic(sptep, REMOVED_SPTE);
|
|
|
|
if (!is_removed_spte(old_spte))
|
2021-02-03 02:57:28 +08:00
|
|
|
break;
|
|
|
|
cpu_relax();
|
|
|
|
}
|
2021-02-03 02:57:26 +08:00
|
|
|
} else {
|
2021-03-10 08:30:29 +08:00
|
|
|
/*
|
|
|
|
* If the SPTE is not MMU-present, there is no backing
|
|
|
|
* page associated with the SPTE and so no side effects
|
|
|
|
* that need to be recorded, and exclusive ownership of
|
|
|
|
* mmu_lock ensures the SPTE can't be made present.
|
|
|
|
* Note, zapping MMIO SPTEs is also unnecessary as they
|
|
|
|
* are guarded by the memslots generation, not by being
|
|
|
|
* unreachable.
|
|
|
|
*/
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
old_spte = kvm_tdp_mmu_read_spte(sptep);
|
|
|
|
if (!is_shadow_present_pte(old_spte))
|
2021-03-10 08:30:29 +08:00
|
|
|
continue;
|
2021-02-03 02:57:28 +08:00
|
|
|
|
|
|
|
/*
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
* Use the common helper instead of a raw WRITE_ONCE as
|
|
|
|
* the SPTE needs to be updated atomically if it can be
|
|
|
|
* modified by a different vCPU outside of mmu_lock.
|
|
|
|
* Even though the parent SPTE is !PRESENT, the TLB
|
|
|
|
* hasn't yet been flushed, and both Intel and AMD
|
|
|
|
* document that A/D assists can use upper-level PxE
|
|
|
|
* entries that are cached in the TLB, i.e. the CPU can
|
|
|
|
* still access the page and mark it dirty.
|
|
|
|
*
|
|
|
|
* No retry is needed in the atomic update path as the
|
|
|
|
* sole concern is dropping a Dirty bit, i.e. no other
|
|
|
|
* task can zap/remove the SPTE as mmu_lock is held for
|
|
|
|
* write. Marking the SPTE as a removed SPTE is not
|
|
|
|
* strictly necessary for the same reason, but using
|
|
|
|
* the remove SPTE value keeps the shared/exclusive
|
|
|
|
* paths consistent and allows the handle_changed_spte()
|
|
|
|
* call below to hardcode the new value to REMOVED_SPTE.
|
|
|
|
*
|
|
|
|
* Note, even though dropping a Dirty bit is the only
|
|
|
|
* scenario where a non-atomic update could result in a
|
|
|
|
* functional bug, simply checking the Dirty bit isn't
|
|
|
|
* sufficient as a fast page fault could read the upper
|
|
|
|
* level SPTE before it is zapped, and then make this
|
|
|
|
* target SPTE writable, resume the guest, and set the
|
|
|
|
* Dirty bit between reading the SPTE above and writing
|
|
|
|
* it here.
|
2021-02-03 02:57:28 +08:00
|
|
|
*/
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
|
|
|
|
REMOVED_SPTE, level);
|
2021-02-03 02:57:26 +08:00
|
|
|
}
|
2021-02-03 02:57:28 +08:00
|
|
|
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
old_spte, REMOVED_SPTE, level, shared);
|
2021-02-03 02:57:11 +08:00
|
|
|
}
|
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
|
2021-02-03 02:57:11 +08:00
|
|
|
}
|
|
|
|
|
2020-10-15 02:26:45 +08:00
|
|
|
/**
|
2023-03-22 06:00:21 +08:00
|
|
|
* handle_changed_spte - handle bookkeeping associated with an SPTE change
|
2020-10-15 02:26:45 +08:00
|
|
|
* @kvm: kvm instance
|
|
|
|
* @as_id: the address space of the paging structure the SPTE was a part of
|
|
|
|
* @gfn: the base GFN that was mapped by the SPTE
|
|
|
|
* @old_spte: The value of the SPTE before the change
|
|
|
|
* @new_spte: The value of the SPTE after the change
|
|
|
|
* @level: the level of the PT the SPTE is part of in the paging structure
|
2021-02-03 02:57:26 +08:00
|
|
|
* @shared: This operation may not be running under the exclusive use of
|
|
|
|
* the MMU lock and the operation must synchronize with other
|
|
|
|
* threads that might be modifying SPTEs.
|
2020-10-15 02:26:45 +08:00
|
|
|
*
|
2023-03-22 06:00:20 +08:00
|
|
|
* Handle bookkeeping that might result from the modification of a SPTE. Note,
|
|
|
|
* dirty logging updates are handled in common code, not here (see make_spte()
|
|
|
|
* and fast_pf_fix_direct_spte()).
|
2020-10-15 02:26:45 +08:00
|
|
|
*/
|
2023-03-22 06:00:21 +08:00
|
|
|
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
|
|
|
|
u64 old_spte, u64 new_spte, int level,
|
|
|
|
bool shared)
|
2020-10-15 02:26:45 +08:00
|
|
|
{
|
|
|
|
bool was_present = is_shadow_present_pte(old_spte);
|
|
|
|
bool is_present = is_shadow_present_pte(new_spte);
|
|
|
|
bool was_leaf = was_present && is_last_spte(old_spte, level);
|
|
|
|
bool is_leaf = is_present && is_last_spte(new_spte, level);
|
|
|
|
bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
|
|
|
|
|
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 08:47:17 +08:00
|
|
|
WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
|
|
|
|
WARN_ON_ONCE(level < PG_LEVEL_4K);
|
|
|
|
WARN_ON_ONCE(gfn & (KVM_PAGES_PER_HPAGE(level) - 1));
|
2020-10-15 02:26:45 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this warning were to trigger it would indicate that there was a
|
|
|
|
* missing MMU notifier or a race with some notifier handler.
|
|
|
|
* A present, leaf SPTE should never be directly replaced with another
|
2021-03-18 22:28:01 +08:00
|
|
|
* present leaf SPTE pointing to a different PFN. A notifier handler
|
2020-10-15 02:26:45 +08:00
|
|
|
* should be zapping the SPTE before the main MM's page table is
|
|
|
|
* changed, or the SPTE should be zeroed, and the TLBs flushed by the
|
|
|
|
* thread before replacement.
|
|
|
|
*/
|
|
|
|
if (was_leaf && is_leaf && pfn_changed) {
|
|
|
|
pr_err("Invalid SPTE change: cannot replace a present leaf\n"
|
|
|
|
"SPTE with another present leaf SPTE mapping a\n"
|
|
|
|
"different PFN!\n"
|
|
|
|
"as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
|
|
|
|
as_id, gfn, old_spte, new_spte, level);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Crash the host to prevent error propagation and guest data
|
2021-03-18 22:28:01 +08:00
|
|
|
* corruption.
|
2020-10-15 02:26:45 +08:00
|
|
|
*/
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (old_spte == new_spte)
|
|
|
|
return;
|
|
|
|
|
2020-10-28 01:59:44 +08:00
|
|
|
trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
|
|
|
|
|
2022-01-26 07:05:15 +08:00
|
|
|
if (is_leaf)
|
|
|
|
check_spte_writable_invariants(new_spte);
|
|
|
|
|
2020-10-15 02:26:45 +08:00
|
|
|
/*
|
|
|
|
* The only times a SPTE should be changed from a non-present to
|
|
|
|
* non-present state is when an MMIO entry is installed/modified/
|
|
|
|
* removed. In that case, there is nothing to do here.
|
|
|
|
*/
|
|
|
|
if (!was_present && !is_present) {
|
|
|
|
/*
|
2021-02-03 02:57:27 +08:00
|
|
|
* If this change does not involve a MMIO SPTE or removed SPTE,
|
|
|
|
* it is unexpected. Log the change, though it should not
|
|
|
|
* impact the guest since both the former and current SPTEs
|
|
|
|
* are nonpresent.
|
2020-10-15 02:26:45 +08:00
|
|
|
*/
|
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 08:47:17 +08:00
|
|
|
if (WARN_ON_ONCE(!is_mmio_spte(old_spte) &&
|
|
|
|
!is_mmio_spte(new_spte) &&
|
|
|
|
!is_removed_spte(new_spte)))
|
2020-10-15 02:26:45 +08:00
|
|
|
pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
|
|
|
|
"should not be replaced with another,\n"
|
|
|
|
"different nonpresent SPTE, unless one or both\n"
|
2021-02-03 02:57:27 +08:00
|
|
|
"are MMIO SPTEs, or the new SPTE is\n"
|
|
|
|
"a temporary removed SPTE.\n"
|
2020-10-15 02:26:45 +08:00
|
|
|
"as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
|
|
|
|
as_id, gfn, old_spte, new_spte, level);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-08-03 12:46:07 +08:00
|
|
|
if (is_leaf != was_leaf)
|
|
|
|
kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
|
2020-10-15 02:26:45 +08:00
|
|
|
|
|
|
|
if (was_leaf && is_dirty_spte(old_spte) &&
|
2021-02-26 04:47:27 +08:00
|
|
|
(!is_present || !is_dirty_spte(new_spte) || pfn_changed))
|
2020-10-15 02:26:45 +08:00
|
|
|
kvm_set_pfn_dirty(spte_to_pfn(old_spte));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Recursively handle child PTs if the change removed a subtree from
|
2022-02-26 08:15:25 +08:00
|
|
|
* the paging structure. Note the WARN on the PFN changing without the
|
|
|
|
* SPTE being converted to a hugepage (leaf) or being zapped. Shadow
|
|
|
|
* pages are kernel allocations and should never be migrated.
|
2020-10-15 02:26:45 +08:00
|
|
|
*/
|
2022-02-26 08:15:25 +08:00
|
|
|
if (was_present && !was_leaf &&
|
|
|
|
(is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
|
2022-01-20 07:07:27 +08:00
|
|
|
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
|
2020-10-15 02:26:45 +08:00
|
|
|
|
2023-03-22 06:00:21 +08:00
|
|
|
if (was_leaf && is_accessed_spte(old_spte) &&
|
|
|
|
(!is_present || !is_accessed_spte(new_spte) || pfn_changed))
|
|
|
|
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
|
2020-10-15 02:26:45 +08:00
|
|
|
}
|
2020-10-15 02:26:47 +08:00
|
|
|
|
2021-02-03 02:57:26 +08:00
|
|
|
/*
|
2021-09-23 23:20:48 +08:00
|
|
|
* tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
|
|
|
|
* and handle the associated bookkeeping. Do not mark the page dirty
|
2021-04-02 07:37:34 +08:00
|
|
|
* in KVM's dirty bitmaps.
|
2021-02-03 02:57:26 +08:00
|
|
|
*
|
2022-01-20 07:07:24 +08:00
|
|
|
* If setting the SPTE fails because it has changed, iter->old_spte will be
|
|
|
|
* refreshed to the current value of the spte.
|
|
|
|
*
|
2021-02-03 02:57:26 +08:00
|
|
|
* @kvm: kvm instance
|
|
|
|
* @iter: a tdp_iter instance currently on the SPTE that should be set
|
|
|
|
* @new_spte: The value the SPTE should be set to
|
2022-01-20 07:07:25 +08:00
|
|
|
* Return:
|
|
|
|
* * 0 - If the SPTE was set.
|
|
|
|
* * -EBUSY - If the SPTE cannot be set. In this case this function will have
|
|
|
|
* no side-effects other than setting iter->old_spte to the last
|
|
|
|
* known value of the spte.
|
2021-02-03 02:57:26 +08:00
|
|
|
*/
|
2022-01-20 07:07:25 +08:00
|
|
|
static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
|
|
|
|
struct tdp_iter *iter,
|
|
|
|
u64 new_spte)
|
2021-02-03 02:57:26 +08:00
|
|
|
{
|
2022-01-20 07:07:24 +08:00
|
|
|
u64 *sptep = rcu_dereference(iter->sptep);
|
|
|
|
|
2021-02-03 02:57:27 +08:00
|
|
|
/*
|
2022-02-26 08:15:42 +08:00
|
|
|
* The caller is responsible for ensuring the old SPTE is not a REMOVED
|
|
|
|
* SPTE. KVM should never attempt to zap or manipulate a REMOVED SPTE,
|
|
|
|
* and pre-checking before inserting a new SPTE is advantageous as it
|
|
|
|
* avoids unnecessary work.
|
2021-02-03 02:57:27 +08:00
|
|
|
*/
|
2022-02-26 08:15:42 +08:00
|
|
|
WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
|
|
|
|
|
|
|
|
lockdep_assert_held_read(&kvm->mmu_lock);
|
2021-02-03 02:57:27 +08:00
|
|
|
|
2021-07-14 06:09:55 +08:00
|
|
|
/*
|
|
|
|
* Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
|
2023-04-25 19:39:32 +08:00
|
|
|
* does not hold the mmu_lock. On failure, i.e. if a different logical
|
|
|
|
* CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
|
|
|
|
* the current value, so the caller operates on fresh data, e.g. if it
|
|
|
|
* retries tdp_mmu_set_spte_atomic()
|
2021-07-14 06:09:55 +08:00
|
|
|
*/
|
2022-05-18 21:51:11 +08:00
|
|
|
if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
|
2022-01-20 07:07:25 +08:00
|
|
|
return -EBUSY;
|
2021-02-03 02:57:26 +08:00
|
|
|
|
2023-03-22 06:00:21 +08:00
|
|
|
handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
|
|
|
|
new_spte, iter->level, true);
|
2021-02-03 02:57:26 +08:00
|
|
|
|
2022-01-20 07:07:25 +08:00
|
|
|
return 0;
|
2021-02-03 02:57:26 +08:00
|
|
|
}
|
|
|
|
|
2022-01-20 07:07:25 +08:00
|
|
|
static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
|
|
|
|
struct tdp_iter *iter)
|
2021-02-03 02:57:27 +08:00
|
|
|
{
|
2022-01-20 07:07:25 +08:00
|
|
|
int ret;
|
|
|
|
|
2021-02-03 02:57:27 +08:00
|
|
|
/*
|
|
|
|
* Freeze the SPTE by setting it to a special,
|
|
|
|
* non-present value. This will stop other threads from
|
|
|
|
* immediately installing a present entry in its place
|
|
|
|
* before the TLBs are flushed.
|
|
|
|
*/
|
2022-01-20 07:07:25 +08:00
|
|
|
ret = tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2021-02-03 02:57:27 +08:00
|
|
|
|
2022-10-10 20:19:17 +08:00
|
|
|
kvm_flush_remote_tlbs_gfn(kvm, iter->gfn, iter->level);
|
2021-02-03 02:57:27 +08:00
|
|
|
|
|
|
|
/*
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
* No other thread can overwrite the removed SPTE as they must either
|
|
|
|
* wait on the MMU lock or use tdp_mmu_set_spte_atomic() which will not
|
|
|
|
* overwrite the special removed SPTE value. No bookkeeping is needed
|
|
|
|
* here since the SPTE is going from non-present to non-present. Use
|
|
|
|
* the raw write helper to avoid an unnecessary check on volatile bits.
|
2021-02-03 02:57:27 +08:00
|
|
|
*/
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
__kvm_tdp_mmu_write_spte(iter->sptep, 0);
|
2021-02-03 02:57:27 +08:00
|
|
|
|
2022-01-20 07:07:25 +08:00
|
|
|
return 0;
|
2021-02-03 02:57:27 +08:00
|
|
|
}
|
|
|
|
|
2021-02-03 02:57:26 +08:00
|
|
|
|
2021-02-03 02:57:08 +08:00
|
|
|
/*
|
2023-03-22 06:00:19 +08:00
|
|
|
* tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
|
2022-02-26 08:15:30 +08:00
|
|
|
* @kvm: KVM instance
|
|
|
|
* @as_id: Address space ID, i.e. regular vs. SMM
|
|
|
|
* @sptep: Pointer to the SPTE
|
|
|
|
* @old_spte: The current value of the SPTE
|
|
|
|
* @new_spte: The new value that will be set for the SPTE
|
|
|
|
* @gfn: The base GFN that was (or will be) mapped by the SPTE
|
|
|
|
* @level: The level _containing_ the SPTE (its parent PT's level)
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
*
|
|
|
|
* Returns the old SPTE value, which _may_ be different than @old_spte if the
|
|
|
|
* SPTE had voldatile bits.
|
2021-02-03 02:57:08 +08:00
|
|
|
*/
|
2023-03-22 06:00:19 +08:00
|
|
|
static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
|
|
|
|
u64 old_spte, u64 new_spte, gfn_t gfn, int level)
|
2020-10-15 02:26:47 +08:00
|
|
|
{
|
2021-02-03 02:57:24 +08:00
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
2021-02-03 02:57:09 +08:00
|
|
|
|
2021-02-03 02:57:27 +08:00
|
|
|
/*
|
2022-02-26 08:15:29 +08:00
|
|
|
* No thread should be using this function to set SPTEs to or from the
|
2021-02-03 02:57:27 +08:00
|
|
|
* temporary removed SPTE value.
|
|
|
|
* If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
|
|
|
|
* should be used. If operating under the MMU lock in write mode, the
|
|
|
|
* use of the removed SPTE should not be necessary.
|
|
|
|
*/
|
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 08:47:17 +08:00
|
|
|
WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
|
2021-02-03 02:57:27 +08:00
|
|
|
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
|
2022-02-26 08:15:30 +08:00
|
|
|
|
2023-03-22 06:00:21 +08:00
|
|
|
handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
|
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:43 +08:00
|
|
|
return old_spte;
|
2022-02-26 08:15:30 +08:00
|
|
|
}
|
|
|
|
|
2023-03-22 06:00:19 +08:00
|
|
|
static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
|
|
|
|
u64 new_spte)
|
2022-02-26 08:15:30 +08:00
|
|
|
{
|
|
|
|
WARN_ON_ONCE(iter->yielded);
|
2023-03-22 06:00:19 +08:00
|
|
|
iter->old_spte = tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep,
|
|
|
|
iter->old_spte, new_spte,
|
|
|
|
iter->gfn, iter->level);
|
2020-10-15 02:26:53 +08:00
|
|
|
}
|
2020-10-15 02:26:47 +08:00
|
|
|
|
|
|
|
#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
|
2022-01-20 07:07:32 +08:00
|
|
|
for_each_tdp_pte(_iter, _root, _start, _end)
|
2020-10-15 02:26:47 +08:00
|
|
|
|
2020-10-15 02:26:53 +08:00
|
|
|
#define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end) \
|
|
|
|
tdp_root_for_each_pte(_iter, _root, _start, _end) \
|
|
|
|
if (!is_shadow_present_pte(_iter.old_spte) || \
|
|
|
|
!is_last_spte(_iter.old_spte, _iter.level)) \
|
|
|
|
continue; \
|
|
|
|
else
|
|
|
|
|
2020-10-15 02:26:50 +08:00
|
|
|
#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
|
2023-07-29 08:51:56 +08:00
|
|
|
for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
|
2020-10-15 02:26:50 +08:00
|
|
|
|
2021-02-03 02:57:07 +08:00
|
|
|
/*
|
|
|
|
* Yield if the MMU lock is contended or this thread needs to return control
|
|
|
|
* to the scheduler.
|
|
|
|
*
|
2021-02-03 02:57:17 +08:00
|
|
|
* If this function should yield and flush is set, it will perform a remote
|
|
|
|
* TLB flush before yielding.
|
|
|
|
*
|
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 11:35:28 +08:00
|
|
|
* If this function yields, iter->yielded is set and the caller must skip to
|
|
|
|
* the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
|
|
|
|
* over the paging structures to allow the iterator to continue its traversal
|
|
|
|
* from the paging structure root.
|
2021-02-03 02:57:07 +08:00
|
|
|
*
|
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 11:35:28 +08:00
|
|
|
* Returns true if this function yielded.
|
2021-02-03 02:57:07 +08:00
|
|
|
*/
|
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 11:35:28 +08:00
|
|
|
static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
|
|
|
|
struct tdp_iter *iter,
|
|
|
|
bool flush, bool shared)
|
2020-10-15 02:26:55 +08:00
|
|
|
{
|
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 08:47:17 +08:00
|
|
|
WARN_ON_ONCE(iter->yielded);
|
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 11:35:28 +08:00
|
|
|
|
2021-02-03 02:57:19 +08:00
|
|
|
/* Ensure forward progress has been made before yielding. */
|
|
|
|
if (iter->next_last_level_gfn == iter->yielded_gfn)
|
|
|
|
return false;
|
|
|
|
|
2021-02-03 02:57:24 +08:00
|
|
|
if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
|
2021-02-03 02:57:17 +08:00
|
|
|
if (flush)
|
|
|
|
kvm_flush_remote_tlbs(kvm);
|
|
|
|
|
2022-02-26 08:15:36 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
2021-04-02 07:37:32 +08:00
|
|
|
if (shared)
|
|
|
|
cond_resched_rwlock_read(&kvm->mmu_lock);
|
|
|
|
else
|
|
|
|
cond_resched_rwlock_write(&kvm->mmu_lock);
|
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
rcu_read_lock();
|
2021-02-03 02:57:19 +08:00
|
|
|
|
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 08:47:17 +08:00
|
|
|
WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn);
|
2021-02-03 02:57:19 +08:00
|
|
|
|
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 11:35:28 +08:00
|
|
|
iter->yielded = true;
|
2020-10-15 02:26:55 +08:00
|
|
|
}
|
2021-02-03 02:57:07 +08:00
|
|
|
|
KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator. Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.
When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.
Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.
Failing to zap a top-level SPTE manifests in one of two ways. If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.
WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2a0 [kvm]
kvm_vcpu_release+0x34/0x60 [kvm]
__fput+0x82/0x240
task_work_run+0x5c/0x90
do_exit+0x364/0xa10
? futex_unqueue+0x38/0x60
do_group_exit+0x33/0xa0
get_signal+0x155/0x850
arch_do_signal_or_restart+0xed/0x750
exit_to_user_mode_prepare+0xc5/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list. This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.
WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
Call Trace:
<TASK>
kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
__handle_changed_spte+0x92e/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
__handle_changed_spte+0x63c/0xca0 [kvm]
zap_gfn_range+0x549/0x620 [kvm]
kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
mmu_free_root_page+0x219/0x2c0 [kvm]
kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
kvm_mmu_unload+0x1c/0xa0 [kvm]
kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
kvm_put_kvm+0x3b1/0x8b0 [kvm]
kvm_vcpu_release+0x4e/0x70 [kvm]
__fput+0x1f7/0x8c0
task_work_run+0xf8/0x1a0
do_exit+0x97b/0x2230
do_group_exit+0xda/0x2a0
get_signal+0x3be/0x1e50
arch_do_signal_or_restart+0x244/0x17f0
exit_to_user_mode_prepare+0xcb/0x120
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry. But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.
Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label. The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.
Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 11:35:28 +08:00
|
|
|
return iter->yielded;
|
2020-10-15 02:26:55 +08:00
|
|
|
}
|
|
|
|
|
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 07:34:16 +08:00
|
|
|
static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
|
2022-02-26 08:15:33 +08:00
|
|
|
{
|
|
|
|
/*
|
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 07:34:16 +08:00
|
|
|
* Bound TDP MMU walks at host.MAXPHYADDR. KVM disallows memslots with
|
|
|
|
* a gpa range that would exceed the max gfn, and KVM does not create
|
|
|
|
* MMIO SPTEs for "impossible" gfns, instead sending such accesses down
|
|
|
|
* the slow emulation path every time.
|
2022-02-26 08:15:33 +08:00
|
|
|
*/
|
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 07:34:16 +08:00
|
|
|
return kvm_mmu_max_gfn() + 1;
|
2022-02-26 08:15:33 +08:00
|
|
|
}
|
|
|
|
|
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
When zapping a TDP MMU root, perform the zap in two passes to avoid
zapping an entire top-level SPTE while holding RCU, which can induce RCU
stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then
zap top-level entries in the second pass.
With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
SPTEs take up to ~7 or so seconds (time varies based on kernel config,
number of (v)CPUs, etc...). With 5-level paging, that time can balloon
well into hundreds of seconds.
Before remote TLB flushes were omitted, the problem was even worse as
waiting for all active vCPUs to respond to the IPI introduced significant
overhead for VMs with large numbers of vCPUs.
By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
the amount of work that is done without dropping RCU protection is
strictly bounded, with the worst case latency for a single operation
being less than 100ms.
Zapping at 1gb in the first pass is not arbitrary. First and foremost,
KVM relies on being able to zap 1gb shadow pages in a single shot when
when repacing a shadow page with a hugepage. Zapping a 1gb shadow page
that is fully populated with 4kb dirty SPTEs also triggers the worst case
latency due writing back the struct page accessed/dirty bits for each 4kb
page, i.e. the two-pass approach is guaranteed to work so long as KVM can
cleany zap a 1gb shadow page.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
softirq=15759/15759 fqs=5058
(t=21016 jiffies g=66453 q=238577)
NMI backtrace for cpu 52
Call Trace:
...
mark_page_accessed+0x266/0x2f0
kvm_set_pfn_accessed+0x31/0x40
handle_removed_tdp_mmu_page+0x259/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
zap_gfn_range+0x141/0x3b0
kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
kvm_mmu_zap_all_fast+0x121/0x190
kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
kvm_page_track_flush_slot+0x5c/0x80
kvm_arch_flush_shadow_memslot+0xe/0x10
kvm_set_memslot+0x172/0x4e0
__kvm_set_memory_region+0x337/0x590
kvm_vm_ioctl+0x49c/0xf80
Reported-by: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 08:15:39 +08:00
|
|
|
static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
|
|
|
|
bool shared, int zap_level)
|
2022-02-26 08:15:33 +08:00
|
|
|
{
|
|
|
|
struct tdp_iter iter;
|
|
|
|
|
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 07:34:16 +08:00
|
|
|
gfn_t end = tdp_mmu_max_gfn_exclusive();
|
2022-02-26 08:15:33 +08:00
|
|
|
gfn_t start = 0;
|
|
|
|
|
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
When zapping a TDP MMU root, perform the zap in two passes to avoid
zapping an entire top-level SPTE while holding RCU, which can induce RCU
stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then
zap top-level entries in the second pass.
With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
SPTEs take up to ~7 or so seconds (time varies based on kernel config,
number of (v)CPUs, etc...). With 5-level paging, that time can balloon
well into hundreds of seconds.
Before remote TLB flushes were omitted, the problem was even worse as
waiting for all active vCPUs to respond to the IPI introduced significant
overhead for VMs with large numbers of vCPUs.
By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
the amount of work that is done without dropping RCU protection is
strictly bounded, with the worst case latency for a single operation
being less than 100ms.
Zapping at 1gb in the first pass is not arbitrary. First and foremost,
KVM relies on being able to zap 1gb shadow pages in a single shot when
when repacing a shadow page with a hugepage. Zapping a 1gb shadow page
that is fully populated with 4kb dirty SPTEs also triggers the worst case
latency due writing back the struct page accessed/dirty bits for each 4kb
page, i.e. the two-pass approach is guaranteed to work so long as KVM can
cleany zap a 1gb shadow page.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
softirq=15759/15759 fqs=5058
(t=21016 jiffies g=66453 q=238577)
NMI backtrace for cpu 52
Call Trace:
...
mark_page_accessed+0x266/0x2f0
kvm_set_pfn_accessed+0x31/0x40
handle_removed_tdp_mmu_page+0x259/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
zap_gfn_range+0x141/0x3b0
kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
kvm_mmu_zap_all_fast+0x121/0x190
kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
kvm_page_track_flush_slot+0x5c/0x80
kvm_arch_flush_shadow_memslot+0xe/0x10
kvm_set_memslot+0x172/0x4e0
__kvm_set_memory_region+0x337/0x590
kvm_vm_ioctl+0x49c/0xf80
Reported-by: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 08:15:39 +08:00
|
|
|
for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
|
|
|
|
retry:
|
|
|
|
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!is_shadow_present_pte(iter.old_spte))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (iter.level > zap_level)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!shared)
|
2023-03-22 06:00:19 +08:00
|
|
|
tdp_mmu_iter_set_spte(kvm, &iter, 0);
|
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
When zapping a TDP MMU root, perform the zap in two passes to avoid
zapping an entire top-level SPTE while holding RCU, which can induce RCU
stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then
zap top-level entries in the second pass.
With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
SPTEs take up to ~7 or so seconds (time varies based on kernel config,
number of (v)CPUs, etc...). With 5-level paging, that time can balloon
well into hundreds of seconds.
Before remote TLB flushes were omitted, the problem was even worse as
waiting for all active vCPUs to respond to the IPI introduced significant
overhead for VMs with large numbers of vCPUs.
By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
the amount of work that is done without dropping RCU protection is
strictly bounded, with the worst case latency for a single operation
being less than 100ms.
Zapping at 1gb in the first pass is not arbitrary. First and foremost,
KVM relies on being able to zap 1gb shadow pages in a single shot when
when repacing a shadow page with a hugepage. Zapping a 1gb shadow page
that is fully populated with 4kb dirty SPTEs also triggers the worst case
latency due writing back the struct page accessed/dirty bits for each 4kb
page, i.e. the two-pass approach is guaranteed to work so long as KVM can
cleany zap a 1gb shadow page.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
softirq=15759/15759 fqs=5058
(t=21016 jiffies g=66453 q=238577)
NMI backtrace for cpu 52
Call Trace:
...
mark_page_accessed+0x266/0x2f0
kvm_set_pfn_accessed+0x31/0x40
handle_removed_tdp_mmu_page+0x259/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
zap_gfn_range+0x141/0x3b0
kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
kvm_mmu_zap_all_fast+0x121/0x190
kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
kvm_page_track_flush_slot+0x5c/0x80
kvm_arch_flush_shadow_memslot+0xe/0x10
kvm_set_memslot+0x172/0x4e0
__kvm_set_memory_region+0x337/0x590
kvm_vm_ioctl+0x49c/0xf80
Reported-by: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 08:15:39 +08:00
|
|
|
else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
|
|
|
|
bool shared)
|
|
|
|
{
|
|
|
|
|
KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root
Allow yielding when zapping SPTEs after the last reference to a valid
root is put. Because KVM must drop all SPTEs in response to relevant
mmu_notifier events, mark defunct roots invalid and reset their refcount
prior to zapping the root. Keeping the refcount elevated while the zap
is in-progress ensures the root is reachable via mmu_notifier until the
zap completes and the last reference to the invalid, defunct root is put.
Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
root in being put has a massive paging structure, e.g. zapping a root
that is backed entirely by 4kb pages for a guest with 32tb of memory can
take hundreds of seconds to complete.
watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
__handle_changed_spte+0x1b2/0x2f0 [kvm]
handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
__handle_changed_spte+0x1f4/0x2f0 [kvm]
handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
__handle_changed_spte+0x1f4/0x2f0 [kvm]
tdp_mmu_zap_root+0x307/0x4d0 [kvm]
kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
kvm_mmu_free_roots+0x22d/0x350 [kvm]
kvm_mmu_reset_context+0x20/0x60 [kvm]
kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
__se_sys_ioctl+0x77/0xc0
__x64_sys_ioctl+0x1d/0x20
do_syscall_64+0x44/0xa0
entry_SYSCALL_64_after_hwframe+0x44/0xae
KVM currently doesn't put a root from a non-preemptible context, so other
than the mmu_notifier wrinkle, yielding when putting a root is safe.
Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
take a reference to each root (it requires mmu_lock be held for the
entire duration of the walk).
tdp_mmu_next_root() is used only by the yield-friendly iterator.
tdp_mmu_zap_root_work() is explicitly yield friendly.
kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
but is still yield-friendly in all call sites, as all callers can be
traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
kvm_create_vm().
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-03 14:50:21 +08:00
|
|
|
/*
|
|
|
|
* The root must have an elevated refcount so that it's reachable via
|
|
|
|
* mmu_notifier callbacks, which allows this path to yield and drop
|
|
|
|
* mmu_lock. When handling an unmap/release mmu_notifier command, KVM
|
|
|
|
* must drop all references to relevant pages prior to completing the
|
|
|
|
* callback. Dropping mmu_lock with an unreachable root would result
|
|
|
|
* in zapping SPTEs after a relevant mmu_notifier callback completes
|
|
|
|
* and lead to use-after-free as zapping a SPTE triggers "writeback" of
|
|
|
|
* dirty accessed bits to the SPTE's associated struct page.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(!refcount_read(&root->tdp_mmu_root_count));
|
|
|
|
|
2022-02-26 08:15:33 +08:00
|
|
|
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
/*
|
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
When zapping a TDP MMU root, perform the zap in two passes to avoid
zapping an entire top-level SPTE while holding RCU, which can induce RCU
stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then
zap top-level entries in the second pass.
With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
SPTEs take up to ~7 or so seconds (time varies based on kernel config,
number of (v)CPUs, etc...). With 5-level paging, that time can balloon
well into hundreds of seconds.
Before remote TLB flushes were omitted, the problem was even worse as
waiting for all active vCPUs to respond to the IPI introduced significant
overhead for VMs with large numbers of vCPUs.
By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
the amount of work that is done without dropping RCU protection is
strictly bounded, with the worst case latency for a single operation
being less than 100ms.
Zapping at 1gb in the first pass is not arbitrary. First and foremost,
KVM relies on being able to zap 1gb shadow pages in a single shot when
when repacing a shadow page with a hugepage. Zapping a 1gb shadow page
that is fully populated with 4kb dirty SPTEs also triggers the worst case
latency due writing back the struct page accessed/dirty bits for each 4kb
page, i.e. the two-pass approach is guaranteed to work so long as KVM can
cleany zap a 1gb shadow page.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
softirq=15759/15759 fqs=5058
(t=21016 jiffies g=66453 q=238577)
NMI backtrace for cpu 52
Call Trace:
...
mark_page_accessed+0x266/0x2f0
kvm_set_pfn_accessed+0x31/0x40
handle_removed_tdp_mmu_page+0x259/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
zap_gfn_range+0x141/0x3b0
kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
kvm_mmu_zap_all_fast+0x121/0x190
kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
kvm_page_track_flush_slot+0x5c/0x80
kvm_arch_flush_shadow_memslot+0xe/0x10
kvm_set_memslot+0x172/0x4e0
__kvm_set_memory_region+0x337/0x590
kvm_vm_ioctl+0x49c/0xf80
Reported-by: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 08:15:39 +08:00
|
|
|
* To avoid RCU stalls due to recursively removing huge swaths of SPs,
|
|
|
|
* split the zap into two passes. On the first pass, zap at the 1gb
|
|
|
|
* level, and then zap top-level SPs on the second pass. "1gb" is not
|
|
|
|
* arbitrary, as KVM must be able to zap a 1gb shadow page without
|
|
|
|
* inducing a stall to allow in-place replacement with a 1gb hugepage.
|
|
|
|
*
|
|
|
|
* Because zapping a SP recurses on its children, stepping down to
|
|
|
|
* PG_LEVEL_4K in the iterator itself is unnecessary.
|
2022-02-26 08:15:33 +08:00
|
|
|
*/
|
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
When zapping a TDP MMU root, perform the zap in two passes to avoid
zapping an entire top-level SPTE while holding RCU, which can induce RCU
stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then
zap top-level entries in the second pass.
With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
SPTEs take up to ~7 or so seconds (time varies based on kernel config,
number of (v)CPUs, etc...). With 5-level paging, that time can balloon
well into hundreds of seconds.
Before remote TLB flushes were omitted, the problem was even worse as
waiting for all active vCPUs to respond to the IPI introduced significant
overhead for VMs with large numbers of vCPUs.
By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
the amount of work that is done without dropping RCU protection is
strictly bounded, with the worst case latency for a single operation
being less than 100ms.
Zapping at 1gb in the first pass is not arbitrary. First and foremost,
KVM relies on being able to zap 1gb shadow pages in a single shot when
when repacing a shadow page with a hugepage. Zapping a 1gb shadow page
that is fully populated with 4kb dirty SPTEs also triggers the worst case
latency due writing back the struct page accessed/dirty bits for each 4kb
page, i.e. the two-pass approach is guaranteed to work so long as KVM can
cleany zap a 1gb shadow page.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
softirq=15759/15759 fqs=5058
(t=21016 jiffies g=66453 q=238577)
NMI backtrace for cpu 52
Call Trace:
...
mark_page_accessed+0x266/0x2f0
kvm_set_pfn_accessed+0x31/0x40
handle_removed_tdp_mmu_page+0x259/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
handle_removed_tdp_mmu_page+0x1c1/0x2e0
__handle_changed_spte+0x223/0x2c0
zap_gfn_range+0x141/0x3b0
kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
kvm_mmu_zap_all_fast+0x121/0x190
kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
kvm_page_track_flush_slot+0x5c/0x80
kvm_arch_flush_shadow_memslot+0xe/0x10
kvm_set_memslot+0x172/0x4e0
__kvm_set_memory_region+0x337/0x590
kvm_vm_ioctl+0x49c/0xf80
Reported-by: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 08:15:39 +08:00
|
|
|
__tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_1G);
|
|
|
|
__tdp_mmu_zap_root(kvm, root, shared, root->role.level);
|
2022-02-26 08:15:33 +08:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2022-02-26 08:15:31 +08:00
|
|
|
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
|
|
|
|
{
|
|
|
|
u64 old_spte;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This helper intentionally doesn't allow zapping a root shadow page,
|
|
|
|
* which doesn't have a parent page table and thus no associated entry.
|
|
|
|
*/
|
|
|
|
if (WARN_ON_ONCE(!sp->ptep))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
|
2022-02-26 08:15:37 +08:00
|
|
|
if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
|
2022-02-26 08:15:31 +08:00
|
|
|
return false;
|
|
|
|
|
2023-03-22 06:00:19 +08:00
|
|
|
tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
|
|
|
|
sp->gfn, sp->role.level + 1);
|
2022-02-26 08:15:31 +08:00
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2020-10-15 02:26:47 +08:00
|
|
|
/*
|
2020-10-15 02:26:52 +08:00
|
|
|
* If can_yield is true, will release the MMU lock and reschedule if the
|
|
|
|
* scheduler needs the CPU or there is contention on the MMU lock. If this
|
|
|
|
* function cannot yield, it will not release the MMU lock or reschedule and
|
|
|
|
* the caller must ensure it does not supply too large a GFN range, or the
|
2021-04-02 07:37:32 +08:00
|
|
|
* operation can cause a soft lockup.
|
2020-10-15 02:26:47 +08:00
|
|
|
*/
|
KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).
This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0
and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit
cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top:
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-26 07:03:48 +08:00
|
|
|
static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
|
|
|
|
gfn_t start, gfn_t end, bool can_yield, bool flush)
|
2020-10-15 02:26:47 +08:00
|
|
|
{
|
|
|
|
struct tdp_iter iter;
|
|
|
|
|
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 07:34:16 +08:00
|
|
|
end = min(end, tdp_mmu_max_gfn_exclusive());
|
2021-08-13 02:14:13 +08:00
|
|
|
|
2022-02-26 08:15:34 +08:00
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
2021-04-02 07:37:32 +08:00
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
|
KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).
This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0
and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit
cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top:
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-26 07:03:48 +08:00
|
|
|
for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
|
2021-02-03 02:57:20 +08:00
|
|
|
if (can_yield &&
|
2022-02-26 08:15:34 +08:00
|
|
|
tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
|
2021-03-26 04:01:17 +08:00
|
|
|
flush = false;
|
2021-02-03 02:57:20 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).
This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0
and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit
cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top:
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-26 07:03:48 +08:00
|
|
|
if (!is_shadow_present_pte(iter.old_spte) ||
|
2020-10-15 02:26:47 +08:00
|
|
|
!is_last_spte(iter.old_spte, iter.level))
|
|
|
|
continue;
|
|
|
|
|
2023-03-22 06:00:19 +08:00
|
|
|
tdp_mmu_iter_set_spte(kvm, &iter, 0);
|
2022-02-26 08:15:34 +08:00
|
|
|
flush = true;
|
2020-10-15 02:26:47 +08:00
|
|
|
}
|
2021-02-03 02:57:23 +08:00
|
|
|
|
2022-03-21 17:05:08 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).
This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0
and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit
cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top:
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-26 07:03:48 +08:00
|
|
|
/*
|
|
|
|
* Because this flow zaps _only_ leaf SPTEs, the caller doesn't need
|
|
|
|
* to provide RCU protection as no 'struct kvm_mmu_page' will be freed.
|
|
|
|
*/
|
|
|
|
return flush;
|
2020-10-15 02:26:47 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-07-28 11:04:52 +08:00
|
|
|
* Zap leaf SPTEs for the range of gfns, [start, end), for all roots. Returns
|
|
|
|
* true if a TLB flush is needed before releasing the MMU lock, i.e. if one or
|
|
|
|
* more SPTEs were zapped since the MMU lock was last acquired.
|
2020-10-15 02:26:47 +08:00
|
|
|
*/
|
KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).
This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0
and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit
cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top:
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-26 07:03:48 +08:00
|
|
|
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
|
|
|
|
bool can_yield, bool flush)
|
2020-10-15 02:26:47 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
|
2022-03-02 21:51:05 +08:00
|
|
|
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
|
KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).
This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0
and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit
cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top:
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-26 07:03:48 +08:00
|
|
|
flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
|
2020-10-15 02:26:47 +08:00
|
|
|
|
|
|
|
return flush;
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_tdp_mmu_zap_all(struct kvm *kvm)
|
|
|
|
{
|
2022-02-26 08:15:33 +08:00
|
|
|
struct kvm_mmu_page *root;
|
2021-03-26 10:19:44 +08:00
|
|
|
int i;
|
|
|
|
|
2022-02-26 08:15:32 +08:00
|
|
|
/*
|
2022-03-03 01:02:07 +08:00
|
|
|
* Zap all roots, including invalid roots, as all SPTEs must be dropped
|
|
|
|
* before returning to the caller. Zap directly even if the root is
|
|
|
|
* also being zapped by a worker. Walking zapped top-level SPTEs isn't
|
|
|
|
* all that expensive and mmu_lock is already held, which means the
|
|
|
|
* worker has yielded, i.e. flushing the work instead of zapping here
|
|
|
|
* isn't guaranteed to be any faster.
|
|
|
|
*
|
2022-02-26 08:15:32 +08:00
|
|
|
* A TLB flush is unnecessary, KVM zaps everything if and only the VM
|
|
|
|
* is being destroyed or the userspace VMM has exited. In both cases,
|
|
|
|
* KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
|
|
|
|
*/
|
2022-02-26 08:15:33 +08:00
|
|
|
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
|
|
|
|
for_each_tdp_mmu_root_yield_safe(kvm, root, i)
|
|
|
|
tdp_mmu_zap_root(kvm, root, false);
|
|
|
|
}
|
2020-10-15 02:26:47 +08:00
|
|
|
}
|
2020-10-15 02:26:50 +08:00
|
|
|
|
2021-04-02 07:37:36 +08:00
|
|
|
/*
|
2022-02-26 08:15:21 +08:00
|
|
|
* Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
|
2022-03-03 01:02:07 +08:00
|
|
|
* zap" completes.
|
2021-04-02 07:37:36 +08:00
|
|
|
*/
|
|
|
|
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
|
|
|
|
{
|
2022-03-03 01:02:07 +08:00
|
|
|
flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
|
2020-10-15 02:26:47 +08:00
|
|
|
}
|
2020-10-15 02:26:50 +08:00
|
|
|
|
2021-04-02 07:37:35 +08:00
|
|
|
/*
|
2022-02-26 08:15:21 +08:00
|
|
|
* Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
|
2022-03-03 01:02:07 +08:00
|
|
|
* is about to be zapped, e.g. in response to a memslots update. The actual
|
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-27 06:03:23 +08:00
|
|
|
* zapping is performed asynchronously. Using a separate workqueue makes it
|
|
|
|
* easy to ensure that the destruction is performed before the "fast zap"
|
|
|
|
* completes, without keeping a separate list of invalidated roots; the list is
|
|
|
|
* effectively the list of work items in the workqueue.
|
2021-04-02 07:37:36 +08:00
|
|
|
*
|
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-27 06:03:23 +08:00
|
|
|
* Note, the asynchronous worker is gifted the TDP MMU's reference.
|
|
|
|
* See kvm_tdp_mmu_get_vcpu_root_hpa().
|
2021-04-02 07:37:35 +08:00
|
|
|
*/
|
|
|
|
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
|
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-27 06:03:23 +08:00
|
|
|
/*
|
|
|
|
* mmu_lock must be held for write to ensure that a root doesn't become
|
|
|
|
* invalid while there are active readers (invalidating a root while
|
|
|
|
* there are active readers may or may not be problematic in practice,
|
|
|
|
* but it's uncharted territory and not supported).
|
|
|
|
*
|
|
|
|
* Waive the assertion if there are no users of @kvm, i.e. the VM is
|
|
|
|
* being destroyed after all references have been put, or if no vCPUs
|
|
|
|
* have been created (which means there are no roots), i.e. the VM is
|
|
|
|
* being destroyed in an error path of KVM_CREATE_VM.
|
|
|
|
*/
|
|
|
|
if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
|
|
|
|
refcount_read(&kvm->users_count) && kvm->created_vcpus)
|
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* As above, mmu_lock isn't held when destroying the VM! There can't
|
|
|
|
* be other references to @kvm, i.e. nothing else can invalidate roots
|
|
|
|
* or be consuming roots, but walking the list of roots does need to be
|
|
|
|
* guarded against roots being deleted by the asynchronous zap worker.
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
|
|
|
|
if (!root->role.invalid) {
|
2021-04-02 07:37:36 +08:00
|
|
|
root->role.invalid = true;
|
2022-03-03 01:02:07 +08:00
|
|
|
tdp_mmu_schedule_zap_root(kvm, root);
|
|
|
|
}
|
2022-02-26 08:15:21 +08:00
|
|
|
}
|
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-27 06:03:23 +08:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
2021-04-02 07:37:35 +08:00
|
|
|
}
|
|
|
|
|
2020-10-15 02:26:50 +08:00
|
|
|
/*
|
|
|
|
* Installs a last-level SPTE to handle a TDP page fault.
|
|
|
|
* (NPT/EPT violation/misconfiguration)
|
|
|
|
*/
|
2021-08-06 16:35:50 +08:00
|
|
|
static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_page_fault *fault,
|
|
|
|
struct tdp_iter *iter)
|
2020-10-15 02:26:50 +08:00
|
|
|
{
|
2021-11-04 00:18:33 +08:00
|
|
|
struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
|
2020-10-15 02:26:50 +08:00
|
|
|
u64 new_spte;
|
2021-06-15 08:57:09 +08:00
|
|
|
int ret = RET_PF_FIXED;
|
2021-08-17 19:32:09 +08:00
|
|
|
bool wrprot = false;
|
2020-10-15 02:26:50 +08:00
|
|
|
|
2022-12-13 11:30:29 +08:00
|
|
|
if (WARN_ON_ONCE(sp->role.level != fault->goal_level))
|
|
|
|
return RET_PF_RETRY;
|
|
|
|
|
2021-09-24 17:05:26 +08:00
|
|
|
if (unlikely(!fault->slot))
|
2020-10-15 02:26:50 +08:00
|
|
|
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
|
2021-02-03 02:57:26 +08:00
|
|
|
else
|
2021-08-17 20:46:45 +08:00
|
|
|
wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
|
2021-09-29 21:19:32 +08:00
|
|
|
fault->pfn, iter->old_spte, fault->prefetch, true,
|
2021-08-17 19:43:19 +08:00
|
|
|
fault->map_writable, &new_spte);
|
2020-10-15 02:26:50 +08:00
|
|
|
|
|
|
|
if (new_spte == iter->old_spte)
|
|
|
|
ret = RET_PF_SPURIOUS;
|
2022-01-20 07:07:25 +08:00
|
|
|
else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
|
2021-02-03 02:57:26 +08:00
|
|
|
return RET_PF_RETRY;
|
2022-02-26 08:15:37 +08:00
|
|
|
else if (is_shadow_present_pte(iter->old_spte) &&
|
|
|
|
!is_last_spte(iter->old_spte, iter->level))
|
2022-10-10 20:19:14 +08:00
|
|
|
kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
|
2020-10-15 02:26:50 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the page fault was caused by a write but the page is write
|
|
|
|
* protected, emulation is needed. If the emulation was skipped,
|
|
|
|
* the vCPU would have the same fault again.
|
|
|
|
*/
|
2021-08-17 19:32:09 +08:00
|
|
|
if (wrprot) {
|
2021-08-06 16:35:50 +08:00
|
|
|
if (fault->write)
|
2020-10-15 02:26:50 +08:00
|
|
|
ret = RET_PF_EMULATE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
|
2021-02-03 02:57:26 +08:00
|
|
|
if (unlikely(is_mmio_spte(new_spte))) {
|
2022-04-23 11:47:49 +08:00
|
|
|
vcpu->stat.pf_mmio_spte_created++;
|
2021-02-03 02:57:26 +08:00
|
|
|
trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
|
|
|
|
new_spte);
|
2020-10-15 02:26:50 +08:00
|
|
|
ret = RET_PF_EMULATE;
|
2021-02-26 04:47:33 +08:00
|
|
|
} else {
|
2021-02-03 02:57:26 +08:00
|
|
|
trace_kvm_mmu_set_spte(iter->level, iter->gfn,
|
|
|
|
rcu_dereference(iter->sptep));
|
2021-02-26 04:47:33 +08:00
|
|
|
}
|
2020-10-15 02:26:50 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-01-20 07:07:28 +08:00
|
|
|
/*
|
2022-01-20 07:07:37 +08:00
|
|
|
* tdp_mmu_link_sp - Replace the given spte with an spte pointing to the
|
|
|
|
* provided page table.
|
2022-01-20 07:07:28 +08:00
|
|
|
*
|
|
|
|
* @kvm: kvm instance
|
|
|
|
* @iter: a tdp_iter instance currently on the SPTE that should be set
|
|
|
|
* @sp: The new TDP page table to install.
|
2022-01-20 07:07:37 +08:00
|
|
|
* @shared: This operation is running under the MMU lock in read mode.
|
2022-01-20 07:07:28 +08:00
|
|
|
*
|
|
|
|
* Returns: 0 if the new page table was installed. Non-0 if the page table
|
|
|
|
* could not be installed (e.g. the atomic compare-exchange failed).
|
|
|
|
*/
|
2022-01-20 07:07:37 +08:00
|
|
|
static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
|
2022-10-20 00:56:14 +08:00
|
|
|
struct kvm_mmu_page *sp, bool shared)
|
2022-01-20 07:07:28 +08:00
|
|
|
{
|
KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.
In other words, don't attempt the fast path just because EPT is enabled.
Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.
The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.
On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.
Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 11:47:44 +08:00
|
|
|
u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
|
2022-01-20 07:07:37 +08:00
|
|
|
int ret = 0;
|
2022-01-20 07:07:28 +08:00
|
|
|
|
2022-01-20 07:07:37 +08:00
|
|
|
if (shared) {
|
|
|
|
ret = tdp_mmu_set_spte_atomic(kvm, iter, spte);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
} else {
|
2023-03-22 06:00:19 +08:00
|
|
|
tdp_mmu_iter_set_spte(kvm, iter, spte);
|
2022-01-20 07:07:37 +08:00
|
|
|
}
|
2022-01-20 07:07:28 +08:00
|
|
|
|
2022-08-23 08:46:37 +08:00
|
|
|
tdp_account_mmu_page(kvm, sp);
|
2022-01-20 07:07:28 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
|
|
|
|
struct kvm_mmu_page *sp, bool shared);
|
|
|
|
|
2020-10-15 02:26:50 +08:00
|
|
|
/*
|
|
|
|
* Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
|
|
|
|
* page tables and SPTEs to translate the faulting guest physical address.
|
|
|
|
*/
|
2021-08-06 16:35:50 +08:00
|
|
|
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
|
2020-10-15 02:26:50 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu *mmu = vcpu->arch.mmu;
|
2022-10-20 00:56:14 +08:00
|
|
|
struct kvm *kvm = vcpu->kvm;
|
2020-10-15 02:26:50 +08:00
|
|
|
struct tdp_iter iter;
|
2020-10-15 02:26:51 +08:00
|
|
|
struct kvm_mmu_page *sp;
|
2022-11-18 00:05:51 +08:00
|
|
|
int ret = RET_PF_RETRY;
|
2020-10-15 02:26:50 +08:00
|
|
|
|
2021-08-07 21:21:53 +08:00
|
|
|
kvm_mmu_hugepage_adjust(vcpu, fault);
|
2020-10-15 02:26:50 +08:00
|
|
|
|
2021-08-06 16:35:50 +08:00
|
|
|
trace_kvm_mmu_spte_requested(fault);
|
2021-02-03 02:57:23 +08:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
2021-08-06 16:35:50 +08:00
|
|
|
tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
|
2022-11-18 00:05:51 +08:00
|
|
|
int r;
|
|
|
|
|
2021-08-07 21:21:53 +08:00
|
|
|
if (fault->nx_huge_page_workaround_enabled)
|
2021-08-06 16:35:50 +08:00
|
|
|
disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
|
2020-10-15 02:26:50 +08:00
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
/*
|
|
|
|
* If SPTE has been frozen by another thread, just give up and
|
|
|
|
* retry, avoiding unnecessary page table allocation and free.
|
|
|
|
*/
|
|
|
|
if (is_removed_spte(iter.old_spte))
|
2022-11-18 00:05:51 +08:00
|
|
|
goto retry;
|
|
|
|
|
2022-12-13 11:30:26 +08:00
|
|
|
if (iter.level == fault->goal_level)
|
KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
Map the leaf SPTE when handling a TDP MMU page fault if and only if the
target level is reached. A recent commit reworked the retry logic and
incorrectly assumed that walking SPTEs would never "fail", as the loop
either bails (retries) or installs parent SPs. However, the iterator
itself will bail early if it detects a frozen (REMOVED) SPTE when
stepping down. The TDP iterator also rereads the current SPTE before
stepping down specifically to avoid walking into a part of the tree that
is being removed, which means it's possible to terminate the loop without
the guts of the loop observing the frozen SPTE, e.g. if a different task
zaps a parent SPTE between the initial read and try_step_down()'s refresh.
Mapping a leaf SPTE at the wrong level results in all kinds of badness as
page table walkers interpret the SPTE as a page table, not a leaf, and
walk into the weeds.
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
Modules linked in: kvm_intel
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
---[ end trace 0000000000000000 ]---
Invalid SPTE change: cannot replace a present leaf
SPTE with another present leaf SPTE mapping a
different PFN!
as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_mmu_map+0x3b0/0x510
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Modules linked in: kvm_intel
---[ end trace 0000000000000000 ]---
Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213033030.83345-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 11:30:27 +08:00
|
|
|
goto map_target_level;
|
2022-12-13 11:30:26 +08:00
|
|
|
|
2022-11-18 00:05:51 +08:00
|
|
|
/* Step down into the lower level page table if it exists. */
|
|
|
|
if (is_shadow_present_pte(iter.old_spte) &&
|
|
|
|
!is_large_pte(iter.old_spte))
|
|
|
|
continue;
|
2020-10-15 02:26:50 +08:00
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
/*
|
|
|
|
* The SPTE is either non-present or points to a huge page that
|
|
|
|
* needs to be split.
|
|
|
|
*/
|
|
|
|
sp = tdp_mmu_alloc_sp(vcpu);
|
|
|
|
tdp_mmu_init_child_sp(sp, &iter);
|
2021-04-29 12:12:26 +08:00
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
|
2022-01-20 07:07:35 +08:00
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
if (is_shadow_present_pte(iter.old_spte))
|
2022-11-18 00:05:51 +08:00
|
|
|
r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
else
|
2022-11-18 00:05:51 +08:00
|
|
|
r = tdp_mmu_link_sp(kvm, &iter, sp, true);
|
2022-10-20 00:56:14 +08:00
|
|
|
|
2022-11-18 00:05:51 +08:00
|
|
|
/*
|
KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
Map the leaf SPTE when handling a TDP MMU page fault if and only if the
target level is reached. A recent commit reworked the retry logic and
incorrectly assumed that walking SPTEs would never "fail", as the loop
either bails (retries) or installs parent SPs. However, the iterator
itself will bail early if it detects a frozen (REMOVED) SPTE when
stepping down. The TDP iterator also rereads the current SPTE before
stepping down specifically to avoid walking into a part of the tree that
is being removed, which means it's possible to terminate the loop without
the guts of the loop observing the frozen SPTE, e.g. if a different task
zaps a parent SPTE between the initial read and try_step_down()'s refresh.
Mapping a leaf SPTE at the wrong level results in all kinds of badness as
page table walkers interpret the SPTE as a page table, not a leaf, and
walk into the weeds.
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
Modules linked in: kvm_intel
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
---[ end trace 0000000000000000 ]---
Invalid SPTE change: cannot replace a present leaf
SPTE with another present leaf SPTE mapping a
different PFN!
as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_mmu_map+0x3b0/0x510
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Modules linked in: kvm_intel
---[ end trace 0000000000000000 ]---
Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213033030.83345-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 11:30:27 +08:00
|
|
|
* Force the guest to retry if installing an upper level SPTE
|
|
|
|
* failed, e.g. because a different task modified the SPTE.
|
2022-11-18 00:05:51 +08:00
|
|
|
*/
|
|
|
|
if (r) {
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
tdp_mmu_free_sp(sp);
|
2022-11-18 00:05:51 +08:00
|
|
|
goto retry;
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
}
|
2022-10-20 00:56:14 +08:00
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
if (fault->huge_page_disallowed &&
|
|
|
|
fault->req_level >= iter.level) {
|
|
|
|
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
|
2022-12-13 11:30:28 +08:00
|
|
|
if (sp->nx_huge_page_disallowed)
|
|
|
|
track_possible_nx_huge_page(kvm, sp);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
|
2020-10-15 02:26:50 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached
Map the leaf SPTE when handling a TDP MMU page fault if and only if the
target level is reached. A recent commit reworked the retry logic and
incorrectly assumed that walking SPTEs would never "fail", as the loop
either bails (retries) or installs parent SPs. However, the iterator
itself will bail early if it detects a frozen (REMOVED) SPTE when
stepping down. The TDP iterator also rereads the current SPTE before
stepping down specifically to avoid walking into a part of the tree that
is being removed, which means it's possible to terminate the loop without
the guts of the loop observing the frozen SPTE, e.g. if a different task
zaps a parent SPTE between the initial read and try_step_down()'s refresh.
Mapping a leaf SPTE at the wrong level results in all kinds of badness as
page table walkers interpret the SPTE as a page table, not a leaf, and
walk into the weeds.
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
Modules linked in: kvm_intel
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
---[ end trace 0000000000000000 ]---
Invalid SPTE change: cannot replace a present leaf
SPTE with another present leaf SPTE mapping a
different PFN!
as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_mmu_map+0x3b0/0x510
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Modules linked in: kvm_intel
---[ end trace 0000000000000000 ]---
Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213033030.83345-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 11:30:27 +08:00
|
|
|
/*
|
|
|
|
* The walk aborted before reaching the target level, e.g. because the
|
|
|
|
* iterator detected an upper level SPTE was frozen during traversal.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(iter.level == fault->goal_level);
|
|
|
|
goto retry;
|
|
|
|
|
|
|
|
map_target_level:
|
2021-08-06 16:35:50 +08:00
|
|
|
ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
|
2020-10-15 02:26:50 +08:00
|
|
|
|
2022-11-18 00:05:51 +08:00
|
|
|
retry:
|
|
|
|
rcu_read_unlock();
|
2020-10-15 02:26:50 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2020-10-15 02:26:52 +08:00
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
|
|
|
|
bool flush)
|
2020-10-15 02:26:52 +08:00
|
|
|
{
|
KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).
This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0
and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit
cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top:
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-26 07:03:48 +08:00
|
|
|
return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
|
|
|
|
range->end, range->may_block, flush);
|
2020-10-15 02:26:52 +08:00
|
|
|
}
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
|
|
|
|
struct kvm_gfn_range *range);
|
2020-10-15 02:26:52 +08:00
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
|
|
|
|
struct kvm_gfn_range *range,
|
|
|
|
tdp_handler_t handler)
|
2020-10-15 02:26:52 +08:00
|
|
|
{
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
struct tdp_iter iter;
|
|
|
|
bool ret = false;
|
|
|
|
|
2021-04-02 08:56:58 +08:00
|
|
|
/*
|
|
|
|
* Don't support rescheduling, none of the MMU notifiers that funnel
|
|
|
|
* into this helper allow blocking; it'd be dead, wasteful code.
|
|
|
|
*/
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
|
2022-02-26 08:15:27 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
|
|
|
|
ret |= handler(kvm, &iter, range);
|
|
|
|
|
2022-02-26 08:15:27 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
|
|
|
|
return ret;
|
2020-10-15 02:26:52 +08:00
|
|
|
}
|
2020-10-15 02:26:53 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
|
|
|
|
* if any of the GFNs in the range have been accessed.
|
2023-03-22 06:00:16 +08:00
|
|
|
*
|
|
|
|
* No need to mark the corresponding PFN as accessed as this call is coming
|
|
|
|
* from the clear_young() or clear_flush_young() notifier, which uses the
|
|
|
|
* return value to determine if the page has been accessed.
|
2020-10-15 02:26:53 +08:00
|
|
|
*/
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
|
|
|
|
struct kvm_gfn_range *range)
|
2020-10-15 02:26:53 +08:00
|
|
|
{
|
2023-03-22 06:00:16 +08:00
|
|
|
u64 new_spte;
|
2020-10-15 02:26:53 +08:00
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
/* If we have a non-accessed entry we don't need to change the pte. */
|
|
|
|
if (!is_accessed_spte(iter->old_spte))
|
|
|
|
return false;
|
2021-02-03 02:57:23 +08:00
|
|
|
|
2023-03-22 06:00:16 +08:00
|
|
|
if (spte_ad_enabled(iter->old_spte)) {
|
|
|
|
iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
|
|
|
|
iter->old_spte,
|
|
|
|
shadow_accessed_mask,
|
|
|
|
iter->level);
|
|
|
|
new_spte = iter->old_spte & ~shadow_accessed_mask;
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
} else {
|
2020-10-15 02:26:53 +08:00
|
|
|
/*
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
* Capture the dirty status of the page, so that it doesn't get
|
|
|
|
* lost when the SPTE is marked for access tracking.
|
2020-10-15 02:26:53 +08:00
|
|
|
*/
|
2023-03-22 06:00:16 +08:00
|
|
|
if (is_writable_pte(iter->old_spte))
|
|
|
|
kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
|
2020-10-15 02:26:53 +08:00
|
|
|
|
2023-03-22 06:00:16 +08:00
|
|
|
new_spte = mark_spte_for_access_track(iter->old_spte);
|
|
|
|
iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
|
|
|
|
iter->old_spte, new_spte,
|
|
|
|
iter->level);
|
2020-10-15 02:26:53 +08:00
|
|
|
}
|
|
|
|
|
2023-03-22 06:00:18 +08:00
|
|
|
trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
|
|
|
|
iter->old_spte, new_spte);
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
return true;
|
2020-10-15 02:26:53 +08:00
|
|
|
}
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
|
2020-10-15 02:26:53 +08:00
|
|
|
{
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
|
2020-10-15 02:26:53 +08:00
|
|
|
}
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
|
|
|
|
struct kvm_gfn_range *range)
|
2020-10-15 02:26:53 +08:00
|
|
|
{
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
return is_accessed_spte(iter->old_spte);
|
2020-10-15 02:26:53 +08:00
|
|
|
}
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
|
2020-10-15 02:26:53 +08:00
|
|
|
{
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
|
2020-10-15 02:26:53 +08:00
|
|
|
}
|
2020-10-15 02:26:54 +08:00
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
|
|
|
|
struct kvm_gfn_range *range)
|
2020-10-15 02:26:54 +08:00
|
|
|
{
|
|
|
|
u64 new_spte;
|
2021-02-03 02:57:23 +08:00
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
/* Huge pages aren't expected to be modified without first being zapped. */
|
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 08:47:17 +08:00
|
|
|
WARN_ON_ONCE(pte_huge(range->arg.pte) || range->start + 1 != range->end);
|
2020-10-15 02:26:54 +08:00
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
if (iter->level != PG_LEVEL_4K ||
|
|
|
|
!is_shadow_present_pte(iter->old_spte))
|
|
|
|
return false;
|
2020-10-15 02:26:54 +08:00
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
/*
|
|
|
|
* Note, when changing a read-only SPTE, it's not strictly necessary to
|
|
|
|
* zero the SPTE before setting the new PFN, but doing so preserves the
|
|
|
|
* invariant that the PFN of a present * leaf SPTE can never change.
|
2023-03-22 06:00:21 +08:00
|
|
|
* See handle_changed_spte().
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
*/
|
2023-03-22 06:00:19 +08:00
|
|
|
tdp_mmu_iter_set_spte(kvm, iter, 0);
|
2020-10-15 02:26:54 +08:00
|
|
|
|
2023-07-29 08:41:44 +08:00
|
|
|
if (!pte_write(range->arg.pte)) {
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
|
2023-07-29 08:41:44 +08:00
|
|
|
pte_pfn(range->arg.pte));
|
2020-10-15 02:26:54 +08:00
|
|
|
|
2023-03-22 06:00:19 +08:00
|
|
|
tdp_mmu_iter_set_spte(kvm, iter, new_spte);
|
2020-10-15 02:26:54 +08:00
|
|
|
}
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
return true;
|
2020-10-15 02:26:54 +08:00
|
|
|
}
|
|
|
|
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
/*
|
|
|
|
* Handle the changed_pte MMU notifier for the TDP MMU.
|
|
|
|
* data is a pointer to the new pte_t mapping the HVA specified by the MMU
|
|
|
|
* notifier.
|
|
|
|
* Returns non-zero if a flush is needed before releasing the MMU lock.
|
|
|
|
*/
|
|
|
|
bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
|
2020-10-15 02:26:54 +08:00
|
|
|
{
|
2022-02-26 08:15:26 +08:00
|
|
|
/*
|
|
|
|
* No need to handle the remote TLB flush under RCU protection, the
|
|
|
|
* target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
|
2023-03-22 06:00:21 +08:00
|
|
|
* shadow page. See the WARN on pfn_changed in handle_changed_spte().
|
2022-02-26 08:15:26 +08:00
|
|
|
*/
|
|
|
|
return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
|
2020-10-15 02:26:54 +08:00
|
|
|
}
|
|
|
|
|
2020-10-15 02:26:55 +08:00
|
|
|
/*
|
2021-05-27 00:32:27 +08:00
|
|
|
* Remove write access from all SPTEs at or above min_level that map GFNs
|
|
|
|
* [start, end). Returns true if an SPTE has been changed and the TLBs need to
|
|
|
|
* be flushed.
|
2020-10-15 02:26:55 +08:00
|
|
|
*/
|
|
|
|
static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
|
|
|
|
gfn_t start, gfn_t end, int min_level)
|
|
|
|
{
|
|
|
|
struct tdp_iter iter;
|
|
|
|
u64 new_spte;
|
|
|
|
bool spte_set = false;
|
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
|
2020-10-15 02:26:55 +08:00
|
|
|
BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
|
|
|
|
|
2022-01-20 07:07:32 +08:00
|
|
|
for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
|
2021-04-02 07:37:34 +08:00
|
|
|
retry:
|
|
|
|
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
|
2021-02-03 02:57:20 +08:00
|
|
|
continue;
|
|
|
|
|
2020-10-15 02:26:55 +08:00
|
|
|
if (!is_shadow_present_pte(iter.old_spte) ||
|
2021-02-03 02:57:21 +08:00
|
|
|
!is_last_spte(iter.old_spte, iter.level) ||
|
|
|
|
!(iter.old_spte & PT_WRITABLE_MASK))
|
2020-10-15 02:26:55 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
|
|
|
|
|
2022-01-20 07:07:25 +08:00
|
|
|
if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
|
2021-04-02 07:37:34 +08:00
|
|
|
goto retry;
|
2022-01-20 07:07:24 +08:00
|
|
|
|
2020-10-15 02:26:55 +08:00
|
|
|
spte_set = true;
|
|
|
|
}
|
2021-02-03 02:57:23 +08:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
2020-10-15 02:26:55 +08:00
|
|
|
return spte_set;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove write access from all the SPTEs mapping GFNs in the memslot. Will
|
|
|
|
* only affect leaf SPTEs down to min_level.
|
|
|
|
* Returns true if an SPTE has been changed and the TLBs need to be flushed.
|
|
|
|
*/
|
2021-07-13 10:33:38 +08:00
|
|
|
bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
|
|
|
|
const struct kvm_memory_slot *slot, int min_level)
|
2020-10-15 02:26:55 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
bool spte_set = false;
|
|
|
|
|
2021-04-02 07:37:34 +08:00
|
|
|
lockdep_assert_held_read(&kvm->mmu_lock);
|
2020-10-15 02:26:55 +08:00
|
|
|
|
2021-12-15 09:15:56 +08:00
|
|
|
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
|
2020-10-15 02:26:55 +08:00
|
|
|
spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
|
|
|
|
slot->base_gfn + slot->npages, min_level);
|
|
|
|
|
|
|
|
return spte_set;
|
|
|
|
}
|
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
|
|
|
|
{
|
|
|
|
struct kvm_mmu_page *sp;
|
|
|
|
|
|
|
|
gfp |= __GFP_ZERO;
|
|
|
|
|
|
|
|
sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
|
|
|
|
if (!sp)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
sp->spt = (void *)__get_free_page(gfp);
|
|
|
|
if (!sp->spt) {
|
|
|
|
kmem_cache_free(mmu_page_header_cache, sp);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return sp;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
|
2022-01-20 07:07:37 +08:00
|
|
|
struct tdp_iter *iter,
|
|
|
|
bool shared)
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *sp;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since we are allocating while under the MMU lock we have to be
|
|
|
|
* careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
|
|
|
|
* reclaim and to avoid making any filesystem callbacks (which can end
|
|
|
|
* up invoking KVM MMU notifiers, resulting in a deadlock).
|
|
|
|
*
|
|
|
|
* If this allocation fails we drop the lock and retry with reclaim
|
|
|
|
* allowed.
|
|
|
|
*/
|
|
|
|
sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
|
|
|
|
if (sp)
|
|
|
|
return sp;
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
2022-01-20 07:07:37 +08:00
|
|
|
|
|
|
|
if (shared)
|
|
|
|
read_unlock(&kvm->mmu_lock);
|
|
|
|
else
|
|
|
|
write_unlock(&kvm->mmu_lock);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
|
|
|
|
iter->yielded = true;
|
|
|
|
sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
|
|
|
|
|
2022-01-20 07:07:37 +08:00
|
|
|
if (shared)
|
|
|
|
read_lock(&kvm->mmu_lock);
|
|
|
|
else
|
|
|
|
write_lock(&kvm->mmu_lock);
|
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
return sp;
|
|
|
|
}
|
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
/* Note, the caller is responsible for initializing @sp. */
|
2022-01-20 07:07:37 +08:00
|
|
|
static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
|
|
|
|
struct kvm_mmu_page *sp, bool shared)
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
{
|
|
|
|
const u64 huge_spte = iter->old_spte;
|
|
|
|
const int level = iter->level;
|
|
|
|
int ret, i;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No need for atomics when writing to sp->spt since the page table has
|
|
|
|
* not been linked in yet and thus is not reachable from any other CPU.
|
|
|
|
*/
|
2022-06-15 07:33:25 +08:00
|
|
|
for (i = 0; i < SPTE_ENT_PER_PAGE; i++)
|
2022-06-23 03:27:05 +08:00
|
|
|
sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, sp->role, i);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Replace the huge spte with a pointer to the populated lower level
|
|
|
|
* page table. Since we are making this change without a TLB flush vCPUs
|
|
|
|
* will see a mix of the split mappings and the original huge mapping,
|
|
|
|
* depending on what's currently in their TLB. This is fine from a
|
|
|
|
* correctness standpoint since the translation will be the same either
|
|
|
|
* way.
|
|
|
|
*/
|
2022-10-20 00:56:14 +08:00
|
|
|
ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
if (ret)
|
2022-01-20 07:07:38 +08:00
|
|
|
goto out;
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* tdp_mmu_link_sp_atomic() will handle subtracting the huge page we
|
|
|
|
* are overwriting from the page stats. But we have to manually update
|
|
|
|
* the page stats with the new present child pages.
|
|
|
|
*/
|
2022-06-15 07:33:25 +08:00
|
|
|
kvm_update_page_stats(kvm, level - 1, SPTE_ENT_PER_PAGE);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
|
2022-01-20 07:07:38 +08:00
|
|
|
out:
|
|
|
|
trace_kvm_mmu_split_huge_page(iter->gfn, huge_spte, level, ret);
|
|
|
|
return ret;
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
|
|
|
|
struct kvm_mmu_page *root,
|
|
|
|
gfn_t start, gfn_t end,
|
2022-01-20 07:07:37 +08:00
|
|
|
int target_level, bool shared)
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *sp = NULL;
|
|
|
|
struct tdp_iter iter;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Traverse the page table splitting all huge pages above the target
|
|
|
|
* level into one lower level. For example, if we encounter a 1GB page
|
|
|
|
* we split it into 512 2MB pages.
|
|
|
|
*
|
|
|
|
* Since the TDP iterator uses a pre-order traversal, we are guaranteed
|
|
|
|
* to visit an SPTE before ever visiting its children, which means we
|
|
|
|
* will correctly recursively split huge pages that are more than one
|
|
|
|
* level above the target level (e.g. splitting a 1GB to 512 2MB pages,
|
|
|
|
* and then splitting each of those to 512 4KB pages).
|
|
|
|
*/
|
|
|
|
for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
|
|
|
|
retry:
|
2022-01-20 07:07:37 +08:00
|
|
|
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!sp) {
|
2022-01-20 07:07:37 +08:00
|
|
|
sp = tdp_mmu_alloc_sp_for_split(kvm, &iter, shared);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
if (!sp) {
|
|
|
|
ret = -ENOMEM;
|
2022-01-20 07:07:38 +08:00
|
|
|
trace_kvm_mmu_split_huge_page(iter.gfn,
|
|
|
|
iter.old_spte,
|
|
|
|
iter.level, ret);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (iter.yielded)
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.
This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.
For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.748378795s
Executing from memory : 2.899670885s
With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:
$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
Populating memory : 2.729544474s
Executing from memory : 0.111965688s <---
This change also reduces the performance impact of dirty logging when
eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
be desirable for read-heavy workloads, as it avoids allocating memory to
split huge pages that are never written and avoids increasing the TLB
miss cost on reads of those pages.
| Config: ept=Y, tdp_mmu=Y, 5% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.332305091s | 0.019615027s | 0.006108211s |
4 | 0.353096020s | 0.019452131s | 0.006214670s |
8 | 0.453938562s | 0.019748246s | 0.006610997s |
16 | 0.719095024s | 0.019972171s | 0.007757889s |
32 | 1.698727124s | 0.021361615s | 0.012274432s |
64 | 2.630673582s | 0.031122014s | 0.016994683s |
96 | 3.016535213s | 0.062608739s | 0.044760838s |
Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.
| Config: ept=Y, tdp_mmu=Y, 100% writes |
| Iteration 1 dirty memory time |
| --------------------------------------------- |
vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
------------ | -------------- | ------------- | ------------ |
2 | 0.317710329s | 0.296204596s | 0.058689782s |
4 | 0.337102375s | 0.299841017s | 0.060343076s |
8 | 0.386025681s | 0.297274460s | 0.060399702s |
16 | 0.791462524s | 0.298942578s | 0.062508699s |
32 | 1.719646014s | 0.313101996s | 0.075984855s |
64 | 2.527973150s | 0.455779206s | 0.079789363s |
96 | 2.681123208s | 0.673778787s | 0.165386739s |
Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221109185905.486172-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-10 02:59:05 +08:00
|
|
|
tdp_mmu_init_child_sp(sp, &iter);
|
|
|
|
|
2022-01-20 07:07:37 +08:00
|
|
|
if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
goto retry;
|
|
|
|
|
|
|
|
sp = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It's possible to exit the loop having never used the last sp if, for
|
|
|
|
* example, a vCPU doing HugePage NX splitting wins the race and
|
|
|
|
* installs its own sp in place of the last sp we tried to split.
|
|
|
|
*/
|
|
|
|
if (sp)
|
|
|
|
tdp_mmu_free_sp(sp);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-01-20 07:07:37 +08:00
|
|
|
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
/*
|
|
|
|
* Try to split all huge pages mapped by the TDP MMU down to the target level.
|
|
|
|
*/
|
|
|
|
void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
|
|
|
|
const struct kvm_memory_slot *slot,
|
|
|
|
gfn_t start, gfn_t end,
|
2022-01-20 07:07:37 +08:00
|
|
|
int target_level, bool shared)
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
int r = 0;
|
|
|
|
|
2022-01-20 07:07:37 +08:00
|
|
|
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
|
2022-03-02 21:44:22 +08:00
|
|
|
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, shared) {
|
2022-01-20 07:07:37 +08:00
|
|
|
r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
if (r) {
|
2022-01-20 07:07:37 +08:00
|
|
|
kvm_tdp_mmu_put_root(kvm, root, shared);
|
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-20 07:07:36 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-10-15 02:26:55 +08:00
|
|
|
/*
|
|
|
|
* Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
|
|
|
|
* AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
|
|
|
|
* If AD bits are not enabled, this will require clearing the writable bit on
|
|
|
|
* each SPTE. Returns true if an SPTE has been changed and the TLBs need to
|
|
|
|
* be flushed.
|
|
|
|
*/
|
|
|
|
static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
|
|
|
|
gfn_t start, gfn_t end)
|
|
|
|
{
|
2023-03-22 06:00:11 +08:00
|
|
|
u64 dbit = kvm_ad_enabled() ? shadow_dirty_mask : PT_WRITABLE_MASK;
|
2020-10-15 02:26:55 +08:00
|
|
|
struct tdp_iter iter;
|
|
|
|
bool spte_set = false;
|
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
|
2020-10-15 02:26:55 +08:00
|
|
|
tdp_root_for_each_leaf_pte(iter, root, start, end) {
|
2021-04-02 07:37:34 +08:00
|
|
|
retry:
|
|
|
|
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
|
2021-02-03 02:57:20 +08:00
|
|
|
continue;
|
|
|
|
|
2022-02-26 08:15:20 +08:00
|
|
|
if (!is_shadow_present_pte(iter.old_spte))
|
|
|
|
continue;
|
|
|
|
|
2023-07-29 08:47:16 +08:00
|
|
|
KVM_MMU_WARN_ON(kvm_ad_enabled() &&
|
|
|
|
spte_ad_need_write_protect(iter.old_spte));
|
2023-03-22 06:00:10 +08:00
|
|
|
|
2023-03-22 06:00:11 +08:00
|
|
|
if (!(iter.old_spte & dbit))
|
|
|
|
continue;
|
2020-10-15 02:26:55 +08:00
|
|
|
|
2023-03-22 06:00:11 +08:00
|
|
|
if (tdp_mmu_set_spte_atomic(kvm, &iter, iter.old_spte & ~dbit))
|
2021-04-02 07:37:34 +08:00
|
|
|
goto retry;
|
2022-01-20 07:07:24 +08:00
|
|
|
|
2020-10-15 02:26:55 +08:00
|
|
|
spte_set = true;
|
|
|
|
}
|
2021-02-03 02:57:23 +08:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
2020-10-15 02:26:55 +08:00
|
|
|
return spte_set;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
|
|
|
|
* AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
|
|
|
|
* If AD bits are not enabled, this will require clearing the writable bit on
|
|
|
|
* each SPTE. Returns true if an SPTE has been changed and the TLBs need to
|
|
|
|
* be flushed.
|
|
|
|
*/
|
2021-07-13 10:33:38 +08:00
|
|
|
bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm,
|
|
|
|
const struct kvm_memory_slot *slot)
|
2020-10-15 02:26:55 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
bool spte_set = false;
|
|
|
|
|
2021-04-02 07:37:34 +08:00
|
|
|
lockdep_assert_held_read(&kvm->mmu_lock);
|
2020-10-15 02:26:55 +08:00
|
|
|
|
2021-12-15 09:15:56 +08:00
|
|
|
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
|
2020-10-15 02:26:55 +08:00
|
|
|
spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
|
|
|
|
slot->base_gfn + slot->npages);
|
|
|
|
|
|
|
|
return spte_set;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clears the dirty status of all the 4k SPTEs mapping GFNs for which a bit is
|
|
|
|
* set in mask, starting at gfn. The given memslot is expected to contain all
|
|
|
|
* the GFNs represented by set bits in the mask. If AD bits are enabled,
|
|
|
|
* clearing the dirty status will involve clearing the dirty bit on each SPTE
|
|
|
|
* or, if AD bits are not enabled, clearing the writable bit on each SPTE.
|
|
|
|
*/
|
|
|
|
static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
|
|
|
|
gfn_t gfn, unsigned long mask, bool wrprot)
|
|
|
|
{
|
2023-03-22 06:00:11 +08:00
|
|
|
u64 dbit = (wrprot || !kvm_ad_enabled()) ? PT_WRITABLE_MASK :
|
|
|
|
shadow_dirty_mask;
|
2020-10-15 02:26:55 +08:00
|
|
|
struct tdp_iter iter;
|
|
|
|
|
2023-06-27 12:26:39 +08:00
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
|
2020-10-15 02:26:55 +08:00
|
|
|
tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
|
|
|
|
gfn + BITS_PER_LONG) {
|
|
|
|
if (!mask)
|
|
|
|
break;
|
|
|
|
|
2023-07-29 08:47:16 +08:00
|
|
|
KVM_MMU_WARN_ON(kvm_ad_enabled() &&
|
|
|
|
spte_ad_need_write_protect(iter.old_spte));
|
2023-03-22 06:00:10 +08:00
|
|
|
|
2020-10-15 02:26:55 +08:00
|
|
|
if (iter.level > PG_LEVEL_4K ||
|
|
|
|
!(mask & (1UL << (iter.gfn - gfn))))
|
|
|
|
continue;
|
|
|
|
|
2021-02-03 02:57:22 +08:00
|
|
|
mask &= ~(1UL << (iter.gfn - gfn));
|
|
|
|
|
2023-03-22 06:00:11 +08:00
|
|
|
if (!(iter.old_spte & dbit))
|
|
|
|
continue;
|
2020-10-15 02:26:55 +08:00
|
|
|
|
2023-03-22 06:00:12 +08:00
|
|
|
iter.old_spte = tdp_mmu_clear_spte_bits(iter.sptep,
|
|
|
|
iter.old_spte, dbit,
|
|
|
|
iter.level);
|
|
|
|
|
2023-03-22 06:00:14 +08:00
|
|
|
trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
|
|
|
|
iter.old_spte,
|
|
|
|
iter.old_spte & ~dbit);
|
|
|
|
kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
|
2020-10-15 02:26:55 +08:00
|
|
|
}
|
2021-02-03 02:57:23 +08:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
2020-10-15 02:26:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clears the dirty status of all the 4k SPTEs mapping GFNs for which a bit is
|
|
|
|
* set in mask, starting at gfn. The given memslot is expected to contain all
|
|
|
|
* the GFNs represented by set bits in the mask. If AD bits are enabled,
|
|
|
|
* clearing the dirty status will involve clearing the dirty bit on each SPTE
|
|
|
|
* or, if AD bits are not enabled, clearing the writable bit on each SPTE.
|
|
|
|
*/
|
|
|
|
void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *slot,
|
|
|
|
gfn_t gfn, unsigned long mask,
|
|
|
|
bool wrprot)
|
|
|
|
{
|
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
|
2021-03-26 10:19:45 +08:00
|
|
|
for_each_tdp_mmu_root(kvm, root, slot->as_id)
|
2020-10-15 02:26:55 +08:00
|
|
|
clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
|
|
|
|
}
|
|
|
|
|
2021-11-20 12:50:21 +08:00
|
|
|
static void zap_collapsible_spte_range(struct kvm *kvm,
|
2020-10-15 02:26:56 +08:00
|
|
|
struct kvm_mmu_page *root,
|
2021-11-20 12:50:21 +08:00
|
|
|
const struct kvm_memory_slot *slot)
|
2020-10-15 02:26:56 +08:00
|
|
|
{
|
2021-02-13 08:50:06 +08:00
|
|
|
gfn_t start = slot->base_gfn;
|
|
|
|
gfn_t end = start + slot->npages;
|
2020-10-15 02:26:56 +08:00
|
|
|
struct tdp_iter iter;
|
2022-05-26 07:09:04 +08:00
|
|
|
int max_mapping_level;
|
2020-10-15 02:26:56 +08:00
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
|
2022-07-16 07:21:06 +08:00
|
|
|
for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
|
|
|
|
retry:
|
2021-11-20 12:50:21 +08:00
|
|
|
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
|
2021-02-03 02:57:20 +08:00
|
|
|
continue;
|
|
|
|
|
2022-07-16 07:21:06 +08:00
|
|
|
if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
|
|
|
|
!is_shadow_present_pte(iter.old_spte))
|
2020-10-15 02:26:56 +08:00
|
|
|
continue;
|
|
|
|
|
2022-05-26 07:09:04 +08:00
|
|
|
/*
|
2022-07-16 07:21:06 +08:00
|
|
|
* Don't zap leaf SPTEs, if a leaf SPTE could be replaced with
|
|
|
|
* a large page size, then its parent would have been zapped
|
|
|
|
* instead of stepping down.
|
2022-05-26 07:09:04 +08:00
|
|
|
*/
|
2022-07-16 07:21:06 +08:00
|
|
|
if (is_last_spte(iter.old_spte, iter.level))
|
2022-05-26 07:09:04 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
2022-07-16 07:21:06 +08:00
|
|
|
* If iter.gfn resides outside of the slot, i.e. the page for
|
|
|
|
* the current level overlaps but is not contained by the slot,
|
|
|
|
* then the SPTE can't be made huge. More importantly, trying
|
|
|
|
* to query that info from slot->arch.lpage_info will cause an
|
|
|
|
* out-of-bounds access.
|
2022-05-26 07:09:04 +08:00
|
|
|
*/
|
2022-07-16 07:21:06 +08:00
|
|
|
if (iter.gfn < start || iter.gfn >= end)
|
|
|
|
continue;
|
2022-05-26 07:09:04 +08:00
|
|
|
|
2022-07-16 07:21:06 +08:00
|
|
|
max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
|
|
|
|
iter.gfn, PG_LEVEL_NUM);
|
|
|
|
if (max_mapping_level < iter.level)
|
|
|
|
continue;
|
2022-05-26 07:09:04 +08:00
|
|
|
|
2022-07-16 07:21:06 +08:00
|
|
|
/* Note, a successful atomic zap also does a remote TLB flush. */
|
|
|
|
if (tdp_mmu_zap_spte_atomic(kvm, &iter))
|
|
|
|
goto retry;
|
2020-10-15 02:26:56 +08:00
|
|
|
}
|
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
rcu_read_unlock();
|
2020-10-15 02:26:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-07-16 07:21:06 +08:00
|
|
|
* Zap non-leaf SPTEs (and free their associated page tables) which could
|
|
|
|
* be replaced by huge pages, for GFNs within the slot.
|
2020-10-15 02:26:56 +08:00
|
|
|
*/
|
2021-11-20 12:50:21 +08:00
|
|
|
void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
|
|
|
|
const struct kvm_memory_slot *slot)
|
2020-10-15 02:26:56 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
|
2021-04-02 07:37:33 +08:00
|
|
|
lockdep_assert_held_read(&kvm->mmu_lock);
|
2020-10-15 02:26:56 +08:00
|
|
|
|
2021-12-15 09:15:56 +08:00
|
|
|
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
|
2021-11-20 12:50:21 +08:00
|
|
|
zap_collapsible_spte_range(kvm, root, slot);
|
2020-10-15 02:26:56 +08:00
|
|
|
}
|
2020-10-15 02:26:57 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Removes write access on the last level SPTE mapping this GFN and unsets the
|
2021-02-26 04:47:43 +08:00
|
|
|
* MMU-writable bit to ensure future writes continue to be intercepted.
|
2020-10-15 02:26:57 +08:00
|
|
|
* Returns true if an SPTE was set and a TLB flush is needed.
|
|
|
|
*/
|
|
|
|
static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
|
2021-04-29 11:41:14 +08:00
|
|
|
gfn_t gfn, int min_level)
|
2020-10-15 02:26:57 +08:00
|
|
|
{
|
|
|
|
struct tdp_iter iter;
|
|
|
|
u64 new_spte;
|
|
|
|
bool spte_set = false;
|
|
|
|
|
2021-04-29 11:41:14 +08:00
|
|
|
BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
|
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
|
2022-01-20 07:07:32 +08:00
|
|
|
for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
|
2021-04-29 11:41:14 +08:00
|
|
|
if (!is_shadow_present_pte(iter.old_spte) ||
|
|
|
|
!is_last_spte(iter.old_spte, iter.level))
|
|
|
|
continue;
|
|
|
|
|
2020-10-15 02:26:57 +08:00
|
|
|
new_spte = iter.old_spte &
|
2021-02-26 04:47:43 +08:00
|
|
|
~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
|
2020-10-15 02:26:57 +08:00
|
|
|
|
2022-01-14 07:30:17 +08:00
|
|
|
if (new_spte == iter.old_spte)
|
|
|
|
break;
|
|
|
|
|
2023-03-22 06:00:19 +08:00
|
|
|
tdp_mmu_iter_set_spte(kvm, &iter, new_spte);
|
2020-10-15 02:26:57 +08:00
|
|
|
spte_set = true;
|
|
|
|
}
|
|
|
|
|
2021-02-03 02:57:23 +08:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
2020-10-15 02:26:57 +08:00
|
|
|
return spte_set;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Removes write access on the last level SPTE mapping this GFN and unsets the
|
2021-02-26 04:47:43 +08:00
|
|
|
* MMU-writable bit to ensure future writes continue to be intercepted.
|
2020-10-15 02:26:57 +08:00
|
|
|
* Returns true if an SPTE was set and a TLB flush is needed.
|
|
|
|
*/
|
|
|
|
bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
|
2021-04-29 11:41:14 +08:00
|
|
|
struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
int min_level)
|
2020-10-15 02:26:57 +08:00
|
|
|
{
|
|
|
|
struct kvm_mmu_page *root;
|
|
|
|
bool spte_set = false;
|
|
|
|
|
2021-02-03 02:57:24 +08:00
|
|
|
lockdep_assert_held_write(&kvm->mmu_lock);
|
2021-03-26 10:19:45 +08:00
|
|
|
for_each_tdp_mmu_root(kvm, root, slot->as_id)
|
2021-04-29 11:41:14 +08:00
|
|
|
spte_set |= write_protect_gfn(kvm, root, gfn, min_level);
|
2021-03-26 10:19:45 +08:00
|
|
|
|
2020-10-15 02:26:57 +08:00
|
|
|
return spte_set;
|
|
|
|
}
|
|
|
|
|
2020-10-15 02:26:58 +08:00
|
|
|
/*
|
|
|
|
* Return the level of the lowest level SPTE added to sptes.
|
|
|
|
* That SPTE may be non-present.
|
2021-07-14 06:09:54 +08:00
|
|
|
*
|
|
|
|
* Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
|
2020-10-15 02:26:58 +08:00
|
|
|
*/
|
2020-12-18 08:31:37 +08:00
|
|
|
int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
|
|
|
|
int *root_level)
|
2020-10-15 02:26:58 +08:00
|
|
|
{
|
|
|
|
struct tdp_iter iter;
|
|
|
|
struct kvm_mmu *mmu = vcpu->arch.mmu;
|
|
|
|
gfn_t gfn = addr >> PAGE_SHIFT;
|
2020-12-18 08:31:36 +08:00
|
|
|
int leaf = -1;
|
2020-10-15 02:26:58 +08:00
|
|
|
|
2022-02-10 20:41:19 +08:00
|
|
|
*root_level = vcpu->arch.mmu->root_role.level;
|
2020-10-15 02:26:58 +08:00
|
|
|
|
|
|
|
tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
|
|
|
|
leaf = iter.level;
|
2020-12-18 08:31:38 +08:00
|
|
|
sptes[leaf] = iter.old_spte;
|
2020-10-15 02:26:58 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return leaf;
|
|
|
|
}
|
2021-07-14 06:09:55 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns the last level spte pointer of the shadow page walk for the given
|
|
|
|
* gpa, and sets *spte to the spte value. This spte may be non-preset. If no
|
|
|
|
* walk could be performed, returns NULL and *spte does not contain valid data.
|
|
|
|
*
|
|
|
|
* Contract:
|
|
|
|
* - Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
|
|
|
|
* - The returned sptep must not be used after kvm_tdp_mmu_walk_lockless_end.
|
|
|
|
*
|
|
|
|
* WARNING: This function is only intended to be called during fast_page_fault.
|
|
|
|
*/
|
|
|
|
u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
|
|
|
|
u64 *spte)
|
|
|
|
{
|
|
|
|
struct tdp_iter iter;
|
|
|
|
struct kvm_mmu *mmu = vcpu->arch.mmu;
|
|
|
|
gfn_t gfn = addr >> PAGE_SHIFT;
|
|
|
|
tdp_ptep_t sptep = NULL;
|
|
|
|
|
|
|
|
tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
|
|
|
|
*spte = iter.old_spte;
|
|
|
|
sptep = iter.sptep;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform the rcu_dereference to get the raw spte pointer value since
|
|
|
|
* we are passing it up to fast_page_fault, which is shared with the
|
|
|
|
* legacy MMU and thus does not retain the TDP MMU-specific __rcu
|
|
|
|
* annotation.
|
|
|
|
*
|
|
|
|
* This is safe since fast_page_fault obeys the contracts of this
|
|
|
|
* function as well as all TDP MMU contracts around modifying SPTEs
|
|
|
|
* outside of mmu_lock.
|
|
|
|
*/
|
|
|
|
return rcu_dereference(sptep);
|
|
|
|
}
|