Commit Graph

1169552 Commits

Author SHA1 Message Date
Sean Christopherson
9ed3bf4112 KVM: x86/mmu: Move filling of Hyper-V's TLB range struct into Hyper-V code
Refactor Hyper-V's range-based TLB flushing API to take a gfn+nr_pages
pair instead of a struct, and bury said struct in Hyper-V specific code.

Passing along two params generates much better code for the common case
where KVM is _not_ running on Hyper-V, as forwarding the flush on to
Hyper-V's hv_flush_remote_tlbs_range() from kvm_flush_remote_tlbs_range()
becomes a tail call.

Cc: David Matlack <dmatlack@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230405003133.419177-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-10 15:17:29 -07:00
Sean Christopherson
8a1300ff95 KVM: x86: Rename Hyper-V remote TLB hooks to match established scheme
Rename the Hyper-V hooks for TLB flushing to match the naming scheme used
by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code,
arch hooks from common code, etc.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-10 15:17:29 -07:00
Vipin Sharma
40fa907e5a KVM: x86/mmu: Merge all handle_changed_pte*() functions
Merge __handle_changed_pte() and handle_changed_spte_acc_track() into a
single function, handle_changed_pte(), as the two are always used
together.  Remove the existing handle_changed_pte(), as it's just a
wrapper that calls __handle_changed_pte() and
handle_changed_spte_acc_track().

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:31 -07:00
Vipin Sharma
1f9973456e KVM: x86/mmu: Remove handle_changed_spte_dirty_log()
Remove handle_changed_spte_dirty_log() as there is no code flow which
sets 4KiB SPTE writable and hit this path. This function marks the page
dirty in a memslot only if new SPTE is 4KiB in size and writable.

Current users of handle_changed_spte_dirty_log() are:
1. set_spte_gfn() - Create only non writable SPTEs.
2. write_protect_gfn() - Change an SPTE to non writable.
3. zap leaf and roots APIs - Everything is 0.
4. handle_removed_pt() - Sets SPTEs to REMOVED_SPTE
5. tdp_mmu_link_sp() - Makes non leaf SPTEs.

There is also no path which creates a writable 4KiB without going
through make_spte() and this functions takes care of marking SPTE dirty
in the memslot if it is PT_WRITABLE.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: add blurb to __handle_changed_spte()'s comment]
Link: https://lore.kernel.org/r/20230321220021.2119033-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
0b7cc2547d KVM: x86/mmu: Remove "record_acc_track" in __tdp_mmu_set_spte()
Remove bool parameter "record_acc_track" from __tdp_mmu_set_spte() and
refactor the code. This variable is always set to true by its caller.

Remove single and double underscore prefix from tdp_mmu_set_spte()
related APIs:
1. Change __tdp_mmu_set_spte() to tdp_mmu_set_spte()
2. Change _tdp_mmu_set_spte() to tdp_mmu_iter_set_spte()

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230321220021.2119033-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
891f115960 KVM: x86/mmu: Bypass __handle_changed_spte() when aging TDP MMU SPTEs
Drop everything except the "tdp_mmu_spte_changed" tracepoint part of
__handle_changed_spte() when aging SPTEs in the TDP MMU, as clearing the
accessed status doesn't affect the SPTE's shadow-present status, whether
or not the SPTE is a leaf, or change the PFN.  I.e. none of the functional
updates handled by __handle_changed_spte() are relevant.

Losing __handle_changed_spte()'s sanity checks does mean that a bug could
theoretical go unnoticed, but that scenario is extremely unlikely, e.g.
would effectively require a misconfigured MMU or a locking bug elsewhere.

Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
6141df067d KVM: x86/mmu: Drop unnecessary dirty log checks when aging TDP MMU SPTEs
Drop the unnecessary call to handle dirty log updates when aging TDP MMU
SPTEs, as neither clearing the Accessed bit nor marking a SPTE for access
tracking can _set_ the Writable bit, i.e. can't trigger marking a gfn
dirty in its memslot.  The access tracking path can _clear_ the Writable
bit, e.g. if the XCHG races with fast_page_fault() and writes the stale
value without the Writable bit set, but clearing the Writable bit outside
of mmu_lock is not allowed, i.e. access tracking can't spuriously set the
Writable bit.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, apply to dirty path, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
7ee131e3a3 KVM: x86/mmu: Clear only A-bit (if enabled) when aging TDP MMU SPTEs
Use tdp_mmu_clear_spte_bits() when clearing the Accessed bit in TDP MMU
SPTEs so as to use an atomic-AND instead of XCHG to clear the A-bit.
Similar to the D-bit story, this will allow KVM to bypass
__handle_changed_spte() by ensuring only the A-bit is modified.

Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
e73008705d KVM: x86/mmu: Remove "record_dirty_log" in __tdp_mmu_set_spte()
Remove bool parameter "record_dirty_log" from __tdp_mmu_set_spte() and
refactor the code as this variable is always set to true by its caller.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230321220021.2119033-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
1e0f42985f KVM: x86/mmu: Bypass __handle_changed_spte() when clearing TDP MMU dirty bits
Drop everything except marking the PFN dirty and the relevant tracepoint
parts of __handle_changed_spte() when clearing the dirty status of gfns in
the TDP MMU.  Clearing only the Dirty (or Writable) bit doesn't affect
the SPTEs shadow-present status, whether or not the SPTE is a leaf, or
change the SPTE's PFN.  I.e. other than marking the PFN dirty, none of the
functional updates handled by __handle_changed_spte() are relevant.

Losing __handle_changed_spte()'s sanity checks does mean that a bug could
theoretical go unnoticed, but that scenario is extremely unlikely, e.g.
would effectively require a misconfigured or a locking bug elsewhere.

Opportunistically remove a comment blurb from __handle_changed_spte()
about all modifications to TDP MMU SPTEs needing to invoke said function,
that "rule" hasn't been true since fast page fault support was added for
the TDP MMU (and perhaps even before).

Tested on a VM (160 vCPUs, 160 GB memory) and found that performance of
clear dirty log stage improved by ~40% in dirty_log_perf_test (with the
full optimization applied).

Before optimization:
--------------------
Iteration 1 clear dirty log time: 3.638543593s
Iteration 2 clear dirty log time: 3.145032742s
Iteration 3 clear dirty log time: 3.142340358s
Clear dirty log over 3 iterations took 9.925916693s. (Avg 3.308638897s/iteration)

After optimization:
-------------------
Iteration 1 clear dirty log time: 2.318988110s
Iteration 2 clear dirty log time: 1.794470164s
Iteration 3 clear dirty log time: 1.791668628s
Clear dirty log over 3 iterations took 5.905126902s. (Avg 1.968375634s/iteration)

Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: split the switch to atomic-AND to a separate patch]
Link: https://lore.kernel.org/r/20230321220021.2119033-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
cf05e8c732 KVM: x86/mmu: Drop access tracking checks when clearing TDP MMU dirty bits
Drop the unnecessary call to handle access-tracking changes when clearing
the dirty status of TDP MMU SPTEs.  Neither the Dirty bit nor the Writable
bit has any impact on the accessed state of a page, i.e. clearing only
the aforementioned bits doesn't make an accessed SPTE suddently not
accessed.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
89c313f20c KVM: x86/mmu: Atomically clear SPTE dirty state in the clear-dirty-log flow
Optimize the clearing of dirty state in TDP MMU SPTEs by doing an
atomic-AND (on SPTEs that have volatile bits) instead of the full XCHG
that currently ends up being invoked (see kvm_tdp_mmu_write_spte()).
Clearing _only_ the bit in question will allow KVM to skip the many
irrelevant checks in __handle_changed_spte() by avoiding any collateral
damage due to the XCHG writing all SPTE bits, e.g. the XCHG could race
with fast_page_fault() setting the W-bit and the CPU setting the D-bit,
and thus incorrectly drop the CPU's D-bit update.

Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[sean: split the switch to atomic-AND to a separate patch]
Link: https://lore.kernel.org/r/20230321220021.2119033-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
697c89bed9 KVM: x86/mmu: Consolidate Dirty vs. Writable clearing logic in TDP MMU
Deduplicate the guts of the TDP MMU's clearing of dirty status by
snapshotting whether to check+clear the Dirty bit vs. the Writable bit,
which is the only difference between the two flavors of dirty tracking.

Note, kvm_ad_enabled() is just a wrapper for shadow_accessed_mask, i.e.
is constant after kvm-{intel,amd}.ko is loaded.

Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, apply to dirty log, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
5982a53926 KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot
Use the constant-after-module-load kvm_ad_enabled() to check if SPTEs in
the TDP MMU need to be write-protected when clearing accessed/dirty status
instead of manually checking every SPTE.  The per-SPTE A/D enabling is
specific to nested EPT MMUs, i.e. when KVM is using EPT A/D bits but L1 is
not, and so cannot happen in the TDP MMU (which is non-nested only).

Keep the original code as sanity checks buried under MMU_WARN_ON().
MMU_WARN_ON() is more or less useless at the moment, but there are plans
to change that.

Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com
Signed-off-by: Vipin Sharma <vipinsh@google.com>
[sean: split to separate patch, apply to dirty path, write changelog]
Link: https://lore.kernel.org/r/20230321220021.2119033-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Vipin Sharma
41e07665f1 KVM: x86/mmu: Add a helper function to check if an SPTE needs atomic write
Move conditions in kvm_tdp_mmu_write_spte() to check if an SPTE should
be written atomically or not to a separate function.

This new function, kvm_tdp_mmu_spte_need_atomic_write(),  will be used
in future commits to optimize clearing bits in SPTEs.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Link: https://lore.kernel.org/r/20230321220021.2119033-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04 12:37:30 -07:00
Mathias Krause
50f1399845 KVM: x86/mmu: Fix comment typo
Fix a small comment typo in make_spte().

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230322013731.102955-6-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 07:46:53 -07:00
Paolo Bonzini
2fdcc1b324 KVM: x86/mmu: Avoid indirect call for get_cr3
Most of the time, calls to get_guest_pgd result in calling
kvm_read_cr3 (the exception is only nested TDP).  Hardcode
the default instead of using the get_cr3 function, avoiding
a retpoline if they are enabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230322013731.102955-2-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 07:46:42 -07:00
Sean Christopherson
f3d90f901d KVM: x86/mmu: Clean up mmu.c functions that put return type on separate line
Adjust a variety of functions in mmu.c to put the function return type on
the same line as the function declaration.  As stated in the Linus
specification:

  But the "on their own line" is complete garbage to begin with. That
  will NEVER be a kernel rule. We should never have a rule that assumes
  things are so long that they need to be on multiple lines.

  We don't put function return types on their own lines either, even if
  some other projects have that rule (just to get function names at the
  beginning of lines or some other odd reason).

Leave the functions generated by BUILD_MMU_ROLE_REGS_ACCESSOR() as-is,
that code is basically illegible no matter how it's formatted.

No functional change intended.

Link: https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com
Signed-off-by: Ben Gardon <bgardon@google.com>
Link: https://lore.kernel.org/r/20230202182809.1929122-4-bgardon@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17 16:02:47 -07:00
Sean Christopherson
eddd9e8302 KVM: x86/mmu: Replace comment with an actual lockdep assertion on mmu_lock
Assert that mmu_lock is held for write in __walk_slot_rmaps() instead of
hoping the function comment will magically prevent introducing bugs.

Signed-off-by: Ben Gardon <bgardon@google.com>
Link: https://lore.kernel.org/r/20230202182809.1929122-3-bgardon@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17 16:02:07 -07:00
Sean Christopherson
727ae37701 KVM: x86/mmu: Rename slot rmap walkers to add clarity and clean up code
Replace "slot_handle_level" with "walk_slot_rmaps" to better capture what
the helpers are doing, and to slightly shorten the function names so that
each function's return type and attributes can be placed on the same line
as the function declaration.

No functional change intended.

Link: https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com
Signed-off-by: Ben Gardon <bgardon@google.com>
Link: https://lore.kernel.org/r/20230202182809.1929122-2-bgardon@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17 16:02:06 -07:00
David Matlack
9d4655da1a KVM: x86/mmu: Use gfn_t in kvm_flush_remote_tlbs_range()
Use gfn_t instead of u64 for kvm_flush_remote_tlbs_range()'s parameters,
since gfn_t is the standard type for GFNs throughout KVM.

Opportunistically rename pages to nr_pages to make its role even more
obvious.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230126184025.2294823-6-dmatlack@google.com
[sean: convert pages to gfn_t too, and rename]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17 15:36:20 -07:00
David Matlack
8c63e8c217 KVM: x86/mmu: Rename kvm_flush_remote_tlbs_with_address()
Rename kvm_flush_remote_tlbs_with_address() to
kvm_flush_remote_tlbs_range(). This name is shorter, which reduces the
number of callsites that need to be broken up across multiple lines, and
more readable since it conveys a range of memory is being flushed rather
than a single address.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230126184025.2294823-5-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17 15:16:12 -07:00
David Matlack
28e4b4597d KVM: x86/mmu: Collapse kvm_flush_remote_tlbs_with_{range,address}() together
Collapse kvm_flush_remote_tlbs_with_range() and
kvm_flush_remote_tlbs_with_address() into a single function. This
eliminates some lines of code and a useless NULL check on the range
struct.

Opportunistically switch from ENOTSUPP to EOPNOTSUPP to make checkpatch
happy.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230126184025.2294823-4-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17 15:08:41 -07:00
Lai Jiangshan
141705b783 KVM: x86/mmu: Track tail count in pte_list_desc to optimize guest fork()
Rework "struct pte_list_desc" and pte_list_{add|remove} to track the tail
count, i.e. number of PTEs in non-head descriptors, and to always keep all
tail descriptors full so that adding a new entry and counting the number
of entries is done in constant time instead of linear time.

No visible performace is changed in tests.  But pte_list_add() is no longer
shown in the perf result for the COWed pages even the guest forks millions
of tasks.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230113122910.672417-1-jiangshanlai@gmail.com
[sean: reword shortlog, tweak changelog, add lots of comments, add BUG_ON()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 19:07:37 -07:00
Lai Jiangshan
19ace7d6ca KVM: x86/mmu: Skip calling mmu->sync_spte() when the spte is 0
Sync the spte only when the spte is set and avoid the indirect branch.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-5-jiangshanlai@gmail.com
[sean: add wrapper instead of open coding each check]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:55 -07:00
Lai Jiangshan
91ca7672dc kvm: x86/mmu: Remove @no_dirty_log from FNAME(prefetch_gpte)
FNAME(prefetch_gpte) is always called with @no_dirty_log=true.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-4-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:55 -07:00
Lai Jiangshan
9fd4a4e3a3 KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead.
In hardware TLB, invalidating TLB entries means the translations are
removed from the TLB.

In KVM shadowed vTLB, the translations (combinations of shadow paging
and hardware TLB) are generally maintained as long as they remain "clean"
when the TLB of an address space (i.e. a PCID or all) is flushed with
the help of write-protections, sp->unsync, and kvm_sync_page(), where
"clean" in this context means that no updates to KVM's SPTEs are needed.

However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is
unsync, and thus triggers a remote flush even if the original vTLB entry
is clean, i.e. is usable as-is.

Besides this, FNAME(invlpg) is largely is a duplicate implementation of
FNAME(sync_spte) to invalidate a vTLB entry.

To address both issues, reuse FNAME(sync_spte) to share the code and
slightly modify the semantics, i.e. keep the vTLB entry if it's "clean"
and avoid remote TLB flush.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:54 -07:00
Lai Jiangshan
ed335278bd KVM: x86/mmu: Allow the roots to be invalid in FNAME(invlpg)
Don't assume the current root to be valid, just check it and remove
the WARN().

Also move the code to check if the root is valid into FNAME(invlpg)
to simplify the code.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-2-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:53 -07:00
Lai Jiangshan
2c86c444e2 KVM: x86/mmu: Use kvm_mmu_invalidate_addr() in nested_ept_invalidate_addr()
Use kvm_mmu_invalidate_addr() instead open calls to mmu->invlpg().

No functional change intended.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216235321.735214-1-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:53 -07:00
Lai Jiangshan
9ebc3f51da KVM: x86/mmu: Use kvm_mmu_invalidate_addr() in kvm_mmu_invpcid_gva()
Use kvm_mmu_invalidate_addr() instead open calls to mmu->invlpg().

No functional change intended.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-10-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:52 -07:00
Lai Jiangshan
cd42853e95 kvm: x86/mmu: Use KVM_MMU_ROOT_XXX for kvm_mmu_invalidate_addr()
The @root_hpa for kvm_mmu_invalidate_addr() is called with @mmu->root.hpa
or INVALID_PAGE where @mmu->root.hpa is to invalidate gva for the current
root (the same meaning as KVM_MMU_ROOT_CURRENT) and INVALID_PAGE is to
invalidate gva for all roots (the same meaning as KVM_MMU_ROOTS_ALL).

Change the argument type of kvm_mmu_invalidate_addr() and use
KVM_MMU_ROOT_XXX instead so that we can reuse the function for
kvm_mmu_invpcid_gva() and nested_ept_invalidate_addr() for invalidating
gva for different set of roots.

No fuctionalities changed.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-9-jiangshanlai@gmail.com
[sean: massage comment slightly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:51 -07:00
Sean Christopherson
f94db0c8b9 KVM: x86/mmu: Sanity check input to kvm_mmu_free_roots()
Tweak KVM_MMU_ROOTS_ALL to precisely cover all current+previous root
flags, and add a sanity in kvm_mmu_free_roots() to verify that the set
of roots to free doesn't stray outside KVM_MMU_ROOTS_ALL.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-8-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:51 -07:00
Lai Jiangshan
e6722d9211 KVM: x86/mmu: Reduce the update to the spte in FNAME(sync_spte)
Sometimes when the guest updates its pagetable, it adds only new gptes
to it without changing any existed one, so there is no point to update
the sptes for these existed gptes.

Also when the sptes for these unchanged gptes are updated, the AD
bits are also removed since make_spte() is called with prefetch=true
which might result unneeded TLB flushing.

Just do nothing if the gpte's permissions are unchanged.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-7-jiangshanlai@gmail.com
[sean: expand comment to call out A/D bits]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:48 -07:00
Lai Jiangshan
c3c6c9fc5d KVM: x86/mmu: Move the code out of FNAME(sync_page)'s loop body into mmu.c
Rename mmu->sync_page to mmu->sync_spte and move the code out
of FNAME(sync_page)'s loop body into mmu.c.

No functionalities change intended.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-6-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 17:19:44 -07:00
Lai Jiangshan
8ef228c20c KVM: x86/mmu: Set mmu->sync_page as NULL for direct paging
mmu->sync_page for direct paging is never called.

And both mmu->sync_page and mm->invlpg only make sense in shadow paging.
Setting mmu->sync_page as NULL for direct paging makes it consistent
with mm->invlpg which is set NULL for the case.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-5-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 12:49:53 -07:00
Lai Jiangshan
51dddf6c49 KVM: x86/mmu: Check mmu->sync_page pointer in kvm_sync_page_check()
Assert that mmu->sync_page is non-NULL as part of the sanity checks
performed before attempting to sync a shadow page.  Explicitly checking
mmu->sync_page is all but guaranteed to be redundant with the existing
sanity check that the MMU is indirect, but the cost is negligible, and
the explicit check also serves as documentation.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-4-jiangshanlai@gmail.com
[sean: increase verbosity of changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 12:44:19 -07:00
Lai Jiangshan
90e444702a KVM: x86/mmu: Move the check in FNAME(sync_page) as kvm_sync_page_check()
Prepare to check mmu->sync_page pointer before calling it.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-3-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 12:42:15 -07:00
Lai Jiangshan
753b43c9d1 KVM: x86/mmu: Use 64-bit address to invalidate to fix a subtle bug
FNAME(invlpg)() and kvm_mmu_invalidate_gva() take a gva_t, i.e. unsigned
long, as the type of the address to invalidate.  On 32-bit kernels, the
upper 32 bits of the GPA will get dropped when an L2 GPA address is
invalidated in the shadowed nested TDP MMU.

Convert it to u64 to fix the problem.

Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-2-jiangshanlai@gmail.com
[sean: tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16 12:41:05 -07:00
Thomas Huth
d8708b80fa KVM: Change return type of kvm_arch_vm_ioctl() to "int"
All kvm_arch_vm_ioctl() implementations now only deal with "int"
types as return values, so we can change the return type of these
functions to use "int" instead of "long".

Signed-off-by: Thomas Huth <thuth@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20230208140105.655814-7-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:18:07 -04:00
Thomas Huth
f15ba52bfa KVM: Standardize on "int" return types instead of "long" in kvm_main.c
KVM functions use "long" return values for functions that are wired up
to "struct file_operations", but otherwise use "int" return values for
functions that can return 0/-errno in order to avoid unintentional
divergences between 32-bit and 64-bit kernels.
Some code still uses "long" in unnecessary spots, though, which can
cause a little bit of confusion and unnecessary size casts. Let's
change these spots to use "int" types, too.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230208140105.655814-6-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:18:07 -04:00
Thomas Huth
2def950c63 KVM: arm64: Limit length in kvm_vm_ioctl_mte_copy_tags() to INT_MAX
In case of success, this function returns the amount of handled bytes.
However, this does not work for large values: The function is called
from kvm_arch_vm_ioctl() (which still returns a long), which in turn
is called from kvm_vm_ioctl() in virt/kvm/kvm_main.c. And that function
stores the return value in an "int r" variable. So the upper 32-bits
of the "long" return value are lost there.

KVM ioctl functions should only return "int" values, so let's limit
the amount of bytes that can be requested here to INT_MAX to avoid
the problem with the truncated return value. We can then also change
the return type of the function to "int" to make it clearer that it
is not possible to return a "long" here.

Fixes: f0376edb1d ("KVM: arm64: Add ioctl to fetch/store tags in a guest")
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Message-Id: <20230208140105.655814-5-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:18:06 -04:00
Thomas Huth
c5edd753a0 KVM: x86: Remove the KVM_GET_NR_MMU_PAGES ioctl
The KVM_GET_NR_MMU_PAGES ioctl is quite questionable on 64-bit hosts
since it fails to return the full 64 bits of the value that can be
set with the corresponding KVM_SET_NR_MMU_PAGES call. Its "long" return
value is truncated into an "int" in the kvm_arch_vm_ioctl() function.

Since this ioctl also never has been used by userspace applications
(QEMU, Google's internal VMM, kvmtool and CrosVM have been checked),
it's likely the best if we remove this badly designed ioctl before
anybody really tries to use it.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230208140105.655814-4-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:18:06 -04:00
Thomas Huth
71fb165e23 KVM: s390: Use "int" as return type for kvm_s390_get/set_skeys()
These two functions only return normal integers, so it does not
make sense to declare the return type as "long" here.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230208140105.655814-3-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:18:06 -04:00
Thomas Huth
67c48662e2 KVM: PPC: Standardize on "int" return types in the powerpc KVM code
Most functions that are related to kvm_arch_vm_ioctl() already use
"int" as return type to pass error values back to the caller. Some
outlier functions use "long" instead for no good reason (they do not
really require long values here). Let's standardize on "int" here to
avoid casting the values back and forth between the two types.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230208140105.655814-2-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:18:05 -04:00
Emanuele Giuseppe Esposito
45cf86f261 kvm: x86: Advertise FLUSH_L1D to user space
FLUSH_L1D was already added in 11e34e64e4, but the feature is not
visible to userspace yet.

The bit definition:
CPUID.(EAX=7,ECX=0):EDX[bit 28]

If the feature is supported by the host, kvm should support it too so
that userspace can choose whether to expose it to the guest or not.
One disadvantage of not exposing it is that the guest will report
a non existing vulnerability in
/sys/devices/system/cpu/vulnerabilities/mmio_stale_data
because the mitigation is present only if the guest supports
(FLUSH_L1D and MD_CLEAR) or FB_CLEAR.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230201132905.549148-4-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:18:05 -04:00
Emanuele Giuseppe Esposito
723d5fb0ff kvm: svm: Add IA32_FLUSH_CMD guest support
Expose IA32_FLUSH_CMD to the guest if the guest CPUID enumerates
support for this MSR. As with IA32_PRED_CMD, permission for
unintercepted writes to this MSR will be granted to the guest after
the first non-zero write.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230201132905.549148-3-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:18:04 -04:00
Emanuele Giuseppe Esposito
a807b78ad0 kvm: vmx: Add IA32_FLUSH_CMD guest support
Expose IA32_FLUSH_CMD to the guest if the guest CPUID enumerates
support for this MSR. As with IA32_PRED_CMD, permission for
unintercepted writes to this MSR will be granted to the guest after
the first non-zero write.

Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230201132905.549148-2-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:17:55 -04:00
Sean Christopherson
fbc722aac1 KVM: VMX: Rename "KVM is using eVMCS" static key to match its wrapper
Rename enable_evmcs to __kvm_is_using_evmcs to match its wrapper, and to
avoid confusion with enabling eVMCS for nested virtualization, i.e. have
"enable eVMCS" be reserved for "enable eVMCS support for L1".

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230211003534.564198-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 10:28:58 -04:00
Sean Christopherson
19f10315fd KVM: VMX: Stub out enable_evmcs static key for CONFIG_HYPERV=n
Wrap enable_evmcs in a helper and stub it out when CONFIG_HYPERV=n in
order to eliminate the static branch nop placeholders.  clang-14 is clever
enough to elide the nop, but gcc-12 is not.  Stubbing out the key reduces
the size of kvm-intel.ko by ~7.5% (200KiB) when compiled with gcc-12
(there are a _lot_ of VMCS accesses throughout KVM).

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230211003534.564198-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 10:28:57 -04:00
Sean Christopherson
68ac422149 KVM: nVMX: Move EVMCS1_SUPPORT_* macros to hyperv.c
Move the macros that define the set of VMCS controls that are supported
by eVMCS1 from hyperv.h to hyperv.c, i.e. make them "private".   The
macros should never be consumed directly by KVM at-large since the "final"
set of supported controls depends on guest CPUID.

No functional change intended.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230211003534.564198-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 10:28:57 -04:00