mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-11-15 08:14:15 +08:00
3ccea4784f
9976 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Ingo Molnar
|
5f2ca44ed2 |
Merge branch 'linus' into x86/urgent, to pick up dependent commit
We want to fix:
|
||
Sean Christopherson
|
8cb4a9a82b |
x86/cpufeatures: Add CPUID_LNX_5 to track recently added Linux-defined word
Add CPUID_LNX_5 to track cpufeatures' word 21, and add the appropriate
compile-time assert in KVM to prevent direct lookups on the features in
CPUID_LNX_5. KVM uses X86_FEATURE_* flags to manage guest CPUID, and so
must translate features that are scattered by Linux from the Linux-defined
bit to the hardware-defined bit, i.e. should never try to directly access
scattered features in guest CPUID.
Opportunistically add NR_CPUID_WORDS to enum cpuid_leafs, along with a
compile-time assert in KVM's CPUID infrastructure to ensure that future
additions update cpuid_leafs along with NCAPINTS.
No functional change intended.
Fixes:
|
||
Borislav Petkov (AMD)
|
0ecaefb303 |
x86/CPU/AMD: Track SNP host status with cc_platform_*()
The host SNP worthiness can determined later, after alternatives have
been patched, in snp_rmptable_init() depending on cmdline options like
iommu=pt which is incompatible with SNP, for example.
Which means that one cannot use X86_FEATURE_SEV_SNP and will need to
have a special flag for that control.
Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places.
Move kdump_sev_callback() to its rightful place, while at it.
Fixes:
|
||
Borislav Petkov (AMD)
|
54f5f47b60 |
x86/kvm/Kconfig: Have KVM_AMD_SEV select ARCH_HAS_CC_PLATFORM
The functionality to load SEV-SNP guests by the host will soon rely on
cc_platform* helpers because the cpu_feature* API with the early
patching is insufficient when SNP support needs to be disabled late.
Therefore, pull that functionality in.
Fixes:
|
||
Paolo Bonzini
|
52b761b48f |
KVM/arm64 fixes for 6.9, part #1
- Ensure perf events programmed to count during guest execution are actually enabled before entering the guest in the nVHE configuration. - Restore out-of-range handler for stage-2 translation faults. - Several fixes to stage-2 TLB invalidations to avoid stale translations, possibly including partial walk caches. - Fix early handling of architectural VHE-only systems to ensure E2H is appropriately set. - Correct a format specifier warning in the arch_timer selftest. - Make the KVM banner message correctly handle all of the possible configurations. -----BEGIN PGP SIGNATURE----- iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZgtpWBccb2xpdmVyLnVw dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFoilAQCQk6kLIeuih5QOe50fK4XkNsyg PGcxw0a0BP8cfjtJsgEArwLlfHQOTE4tRWtXyEHvapJfe/bE1hjLmzUJx7BwLQ4= =6hNq -----END PGP SIGNATURE----- Merge tag 'kvmarm-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.9, part #1 - Ensure perf events programmed to count during guest execution are actually enabled before entering the guest in the nVHE configuration. - Restore out-of-range handler for stage-2 translation faults. - Several fixes to stage-2 TLB invalidations to avoid stale translations, possibly including partial walk caches. - Fix early handling of architectural VHE-only systems to ensure E2H is appropriately set. - Correct a format specifier warning in the arch_timer selftest. - Make the KVM banner message correctly handle all of the possible configurations. |
||
Linus Torvalds
|
1d35aae78f |
Kbuild updates for v6.9
- Generate a list of built DTB files (arch/*/boot/dts/dtbs-list) - Use more threads when building Debian packages in parallel - Fix warnings shown during the RPM kernel package uninstallation - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to Makefile - Support GCC's -fmin-function-alignment flag - Fix a null pointer dereference bug in modpost - Add the DTB support to the RPM package - Various fixes and cleanups in Kconfig -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmX8HGIVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsGYfIQAIl/zEFoNVSHGR4TIvO7SIwkT4MM VAm0W6XRFaXfIGw8HL/MXe+U9jAyeQ9yL9uUVv8PqFTO+LzBbW1X1X97tlmrlQsC 7mdxbA1KJXwkwt4wH/8/EZQMwHr327vtVH4AilSm+gAaWMXaSKAye3ulKQQ2gevz vP6aOcfbHIWOPdxA53cLdSl9LOGrYNczKySHXKV9O39T81F+ko7wPpdkiMWw5LWG ISRCV8bdXli8j10Pmg8jlbevSKl4Z5FG2BVw/Cl8rQ5tBBoCzFsUPnnp9A29G8QP OqRhbwxtkSm67BMJAYdHnhjp/l0AOEbmetTGpna+R06hirOuXhR3vc6YXZxhQjff LmKaqfG5YchRALS1fNDsRUNIkQxVJade+tOUG+V4WbxHQKWX7Ghu5EDlt2/x7P0p +XLPE48HoNQLQOJ+pgIOkaEDl7WLfGhoEtEgprZBuEP2h39xcdbYJyF10ZAAR4UZ FF6J9lDHbf7v1uqD2YnAQJQ6jJ06CvN6/s6SdiJnCWSs5cYRW0fnYigSIuwAgGHZ c/QFECoGEflXGGuqZDl5iXiIjhWKzH2nADSVEs7maP47vapcMWb9gA7VBNoOr5M0 IXuFo1khChF4V2pxqlDj3H5TkDlFENYT/Wjh+vvjx8XplKCRKaSh+LaZ39hja61V dWH7BPecS44h4KXx =tFdl -----END PGP SIGNATURE----- Merge tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Generate a list of built DTB files (arch/*/boot/dts/dtbs-list) - Use more threads when building Debian packages in parallel - Fix warnings shown during the RPM kernel package uninstallation - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to Makefile - Support GCC's -fmin-function-alignment flag - Fix a null pointer dereference bug in modpost - Add the DTB support to the RPM package - Various fixes and cleanups in Kconfig * tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits) kconfig: tests: test dependency after shuffling choices kconfig: tests: add a test for randconfig with dependent choices kconfig: tests: support KCONFIG_SEED for the randconfig runner kbuild: rpm-pkg: add dtb files in kernel rpm kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig() kconfig: check prompt for choice while parsing kconfig: lxdialog: remove unused dialog colors kconfig: lxdialog: fix button color for blackbg theme modpost: fix null pointer dereference kbuild: remove GCC's default -Wpacked-bitfield-compat flag kbuild: unexport abs_srctree and abs_objtree kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1 kconfig: remove named choice support kconfig: use linked list in get_symbol_str() to iterate over menus kconfig: link menus to a symbol kbuild: fix inconsistent indentation in top Makefile kbuild: Use -fmin-function-alignment when available alpha: merge two entries for CONFIG_ALPHA_GAMMA alpha: merge two entries for CONFIG_ALPHA_EV4 kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj) ... |
||
Paolo Bonzini
|
0d1756482e |
Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old
vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting disabled. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmX4yVUACgkQOlYIJqCj N/0BpQ/9Flr0fL9150AUb+yZofb0JTbVRgSNfvY12hr9vIp88KY/ryOw8OzlJy0v veXD3IqSxkClTp+i2ocRJi1zBVo3ww7s6VwWJwY9SkDEfIYyqRWu+Es/mHNZ/0HM BvMcwwyGDtHdZi2BHnztbfLzhh+AQvYm57RKBGyjTx76kdaYiiHwvHRIlJgYTC6q w4YBvInIys8Fj5dGKp1I72UvA0F+db9QOC4vxW/x/OAEcbMi6mMkEzdr3ftK5U/q 8K4h1OvE3PfMXR3S0HDoqnGCenGX/93REhduOO36SfP5gupN0TzkgQwqIAWpqvER zQFdJ3+/6H07q83tlhpThggD7qgqQeg2a/DhFnj6AK5ima44zg+MrW3v14D42hY1 GbBXz9CLWsnzm0ieZqaOhJW1Gx57a9AoXr5YZ7NGQxJ2fEaG7zSAzLMKP28+6PDT 1OXlozPVAMYNL8xZmkA5+QIoBMRUQVaRhXmoW1wr7NqUqHcm6ILQl6DOIM4sGGXL TPMGjBkZwLVv0J5rtcSIIPoXChcB5V1DqqMyIuu+arAzoR8ulcETdqb6kJvyP1HT GQHtinqq/nc0cpaNhkmB4WkLg7fvMlvz5YNPQAEs+2ZZGTiwAo05jMv1Gpky3yI6 XXQf+bhT7ghJdTJy0QKmUGw3YCDjrYXzfYfPEwwewVqAbIlrjFM= =o7dM -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pvunhalt-6.9' of https://github.com/kvm-x86/linux into HEAD Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting disabled. |
||
Paolo Bonzini
|
1d55934ed5 |
KVM SVM changes for 6.9:
- Add support for systems that are configured with SEV and SEV-ES+ enabled, but have all ASIDs assigned to SEV-ES+ guests, which effectively makes SEV unusuable. Cleanup ASID handling to make supporting this scenario less brittle/ugly. - Return -EINVAL instead of -EBUSY if userspace attempts to invoke KVM_SEV{,ES}_INIT on an SEV+ guest. The operation is simply invalid, and not related to resource contention in any way. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXeMssACgkQOlYIJqCj N/3UQg/8D5J0N1jqE6cnPsN3OA733Q+fRkfJd6zLUn5qJ8jqssxeNUiRCCUYIP8b ijuUB1/SCphQoIlAmy73+lmLOs2AMtW5Qaephekv4YZlSlsqIbIq12LJ88PGv/Gd WO6zxeWnIPh1jLvaHA5bqEg6VC/vyl0enCXaw6o0ll3UubAQ5wcHaYoW0SM28bT3 mHJJBjElgvV9845y3sZkWYYP4AYAbrhNWVJLYgxZjByCYPHo5h0bffZKzniWxAZQ kANkotYJ2mMXAnagmuUvxOBxzSSVn7dYijR6u7eAx5PPodv9mptrFyY0XdGl0o8O MexEF4IQRpJN4JhFmC0Wm0Zw42TDq+CSBv2YqHEfnpgN7BYjIqiefx3+DdaQ3fwp czd+EVHHqDOklyCpBmOtZAtqSrSNAJn7OJk36Q/SCaEMbmgyE1nCNAZ7CubHpwET 9jGumcQ2gd+fcw8Ju8ehxD9su7tQun93gIZ5DGGcw3/x0P85V5eWvafjqv5lNnZ+ 5uwHFqt9Bir1Pdk59MyWpIH1YZ//Us3KYe+yApRwyjxMpiilrkYYowvQbu0/3BKo 0WcIDnTezYlF1EdHBruok/lgmIKm04FrlbxwAGFUFD0ClBSwZCr9K59gczX3v4sq giI4lWoHwRN79hM6QioeJcFDzSaxos9hppgcAw0+1fL8RsOPedA= =9jK/ -----END PGP SIGNATURE----- Merge tag 'kvm-x86-svm-6.9' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 6.9: - Add support for systems that are configured with SEV and SEV-ES+ enabled, but have all ASIDs assigned to SEV-ES+ guests, which effectively makes SEV unusuable. Cleanup ASID handling to make supporting this scenario less brittle/ugly. - Return -EINVAL instead of -EBUSY if userspace attempts to invoke KVM_SEV{,ES}_INIT on an SEV+ guest. The operation is simply invalid, and not related to resource contention in any way. |
||
Linus Torvalds
|
4f712ee0cb |
S390:
* Changes to FPU handling came in via the main s390 pull request * Only deliver to the guest the SCLP events that userspace has requested. * More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same). * Fix selftests undefined behavior. x86: * Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec. * Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests). * Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized. * Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest. * Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit. * Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code. * Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. * Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot. * Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels. * Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization. * Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives. * Fix the debugregs ABI for 32-bit KVM. * Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD. * Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work. * Cleanup the logic for checking if the currently loaded vCPU is in-kernel. * Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel. x86 Xen emulation: * Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same. * When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. * Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). * Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. RISC-V: * Support exception and interrupt handling in selftests * New self test for RISC-V architectural timer (Sstc extension) * New extension support (Ztso, Zacas) * Support userspace emulation of random number seed CSRs. ARM: * Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers * Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it * Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path * Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register * Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: * Set reserved bits as zero in CPUCFG. * Start SW timer only when vcpu is blocking. * Do not restart SW timer when it is expired. * Remove unnecessary CSR register saving during enter guest. * Misc cleanups and fixes as usual. Generic: * cleanup Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else. * Factor common "select" statements in common code instead of requiring each architecture to specify it * Remove thoroughly obsolete APIs from the uapi headers. * Move architecture-dependent stuff to uapi/asm/kvm.h * Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded. * Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker. Selftests: * Reduce boilerplate especially when utilize selftest TAP infrastructure. * Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory. * Fix benign bugs where tests neglect to close() guest_memfd files. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmX0iP8UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroND7wf+JZoNvwZ+bmwWe/4jn/YwNoYi/C5z eypn8M1gsWEccpCpqPBwznVm9T29rF4uOlcMvqLEkHfTpaL1EKUUjP1lXPz/ileP 6a2RdOGxAhyTiFC9fjy+wkkjtLbn1kZf6YsS0hjphP9+w0chNbdn0w81dFVnXryd j7XYI8R/bFAthNsJOuZXSEjCfIHxvTTG74OrTf1B1FEBB+arPmrgUeJftMVhffQK Sowgg8L/Ii/x6fgV5NZQVSIyVf1rp8z7c6UaHT4Fwb0+RAMW8p9pYv9Qp1YkKp8y 5j0V9UzOHP7FRaYimZ5BtwQoqiZXYylQ+VuU/Y2f4X85cvlLzSqxaEMAPA== =mqOV -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ... |
||
Linus Torvalds
|
902861e34c |
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx TMNhHfyiHYDTx/GAV9NXW84tasJSDgA= =TG55 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ... |
||
Linus Torvalds
|
216532e147 |
hardening updates for v6.9-rc1
- string.h and related header cleanups (Tanzir Hasan, Andy Shevchenko) - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev, Harshit Mogalapalli) - selftests/powerpc: Fix load_unaligned_zeropad build failure (Michael Ellerman) - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn) - Handle tail call optimization better in LKDTM (Douglas Anderson) - Use long form types in overflow.h (Andy Shevchenko) - Add flags param to string_get_size() (Andy Shevchenko) - Add Coccinelle script for potential struct_size() use (Jacob Keller) - Fix objtool corner case under KCFI (Josh Poimboeuf) - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng) - Add str_plural() helper (Michal Wajdeczko, Kees Cook) - Ignore relocations in .notes section - Add comments to explain how __is_constexpr() works - Fix m68k stack alignment expectations in stackinit Kunit test - Convert string selftests to KUnit - Add KUnit tests for fortified string functions - Improve reporting during fortified string warnings - Allow non-type arg to type_max() and type_min() - Allow strscpy() to be called with only 2 arguments - Add binary mode to leaking_addresses scanner - Various small cleanups to leaking_addresses scanner - Adding wrapping_*() arithmetic helper - Annotate initial signed integer wrap-around in refcount_t - Add explicit UBSAN section to MAINTAINERS - Fix UBSAN self-test warnings - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL - Reintroduce UBSAN's signed overflow sanitizer -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmXvm5kWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJiQqD/4mM6SWZpYHKlR1nEiqIyz7Hqr9 g4oguuw6HIVNJXLyeBI5Hd43CTeHPA0e++EETqhUAt7HhErxfYJY+JB221nRYmu+ zhhQ7N/xbTMV/Je7AR03kQjhiMm8LyEcM2X4BNrsAcoCieQzmO3g0zSp8ISzLUE0 PEEmf1lOzMe3gK2KOFCPt5Hiz9sGWyN6at+BQubY18tQGtjEXYAQNXkpD5qhGn4a EF693r/17wmc8hvSsjf4AGaWy1k8crG0WfpMCZsaqftjj0BbvOC60IDyx4eFjpcy tGyAJKETq161AkCdNweIh2Q107fG3tm0fcvw2dv8Wt1eQCko6M8dUGCBinQs/thh TexjJFS/XbSz+IvxLqgU+C5qkOP23E0M9m1dbIbOFxJAya/5n16WOBlGr3ae2Wdq /+t8wVSJw3vZiku5emWdFYP1VsdIHUjVa5QizFaaRhzLGRwhxVV49SP4IQC/5oM5 3MAgNOFTP6yRQn9Y9wP+SZs+SsfaIE7yfKa9zOi4S+Ve+LI2v4YFhh8NCRiLkeWZ R1dhp8Pgtuq76f/v0qUaWcuuVeGfJ37M31KOGIhi1sI/3sr7UMrngL8D1+F8UZMi zcLu+x4GtfUZCHl6znx1rNUBqE5S/5ndVhLpOqfCXKaQ+RAm7lkOJ3jXE2VhNkhp yVEmeSOLnlCaQjZvXQ== =OP+o -----END PGP SIGNATURE----- Merge tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "As is pretty normal for this tree, there are changes all over the place, especially for small fixes, selftest improvements, and improved macro usability. Some header changes ended up landing via this tree as they depended on the string header cleanups. Also, a notable set of changes is the work for the reintroduction of the UBSAN signed integer overflow sanitizer so that we can continue to make improvements on the compiler side to make this sanitizer a more viable future security hardening option. Summary: - string.h and related header cleanups (Tanzir Hasan, Andy Shevchenko) - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev, Harshit Mogalapalli) - selftests/powerpc: Fix load_unaligned_zeropad build failure (Michael Ellerman) - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn) - Handle tail call optimization better in LKDTM (Douglas Anderson) - Use long form types in overflow.h (Andy Shevchenko) - Add flags param to string_get_size() (Andy Shevchenko) - Add Coccinelle script for potential struct_size() use (Jacob Keller) - Fix objtool corner case under KCFI (Josh Poimboeuf) - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng) - Add str_plural() helper (Michal Wajdeczko, Kees Cook) - Ignore relocations in .notes section - Add comments to explain how __is_constexpr() works - Fix m68k stack alignment expectations in stackinit Kunit test - Convert string selftests to KUnit - Add KUnit tests for fortified string functions - Improve reporting during fortified string warnings - Allow non-type arg to type_max() and type_min() - Allow strscpy() to be called with only 2 arguments - Add binary mode to leaking_addresses scanner - Various small cleanups to leaking_addresses scanner - Adding wrapping_*() arithmetic helper - Annotate initial signed integer wrap-around in refcount_t - Add explicit UBSAN section to MAINTAINERS - Fix UBSAN self-test warnings - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL - Reintroduce UBSAN's signed overflow sanitizer" * tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits) selftests/powerpc: Fix load_unaligned_zeropad build failure string: Convert helpers selftest to KUnit string: Convert selftest to KUnit sh: Fix build with CONFIG_UBSAN=y compiler.h: Explain how __is_constexpr() works overflow: Allow non-type arg to type_max() and type_min() VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler() lib/string_helpers: Add flags param to string_get_size() x86, relocs: Ignore relocations in .notes section objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks overflow: Use POD in check_shl_overflow() lib: stackinit: Adjust target string to 8 bytes for m68k sparc: vdso: Disable UBSAN instrumentation kernel.h: Move lib/cmdline.c prototypes to string.h leaking_addresses: Provide mechanism to scan binary files leaking_addresses: Ignore input device status lines leaking_addresses: Use File::Temp for /tmp files MAINTAINERS: Update LEAKING_ADDRESSES details fortify: Improve buffer overflow reporting fortify: Add KUnit tests for runtime overflows ... |
||
Linus Torvalds
|
0e33cf955f |
* Mitigate RFDS vulnerability
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmXvZgoACgkQaDWVMHDJ krC2Eg//aZKBp97/DSzRqXKDwJzVUr0sGJ9cii0gVT1sI+1U6ZZCh/roVH4xOT5/ HqtOOnQ+X0mwUx2VG3Yv2VPI7VW68sJ3/y9D8R4tnMEsyQ4CmDw96Pre3NyKr/Av jmW7SK94fOkpNFJOMk3zpk7GtRUlCsVkS1P61dOmMYduguhel/V20rWlx83BgnAY Rf/c3rBjqe8Ri3rzBP5icY/d6OgwoafuhME31DD/j6oKOh+EoQBvA4urj46yMTMX /mrK7hCm/wqwuOOvgGbo7sfZNBLCYy3SZ3EyF4beDERhPF1DaSvCwOULpGVJroqu SelFsKXAtEbYrDgsan+MYlx3bQv43q7PbHska1gjkH91plO4nAsssPr5VsusUKmT sq8jyBaauZb40oLOSgooL4RqAHrfs8q5695Ouwh/DB/XovMezUI1N/BkpGFmqpJI o2xH9P5q520pkB8pFhN9TbRuFSGe/dbWC24QTq1DUajo3M3RwcwX6ua9hoAKLtDF pCV5DNcVcXHD3Cxp0M5dQ5JEAiCnW+ZpUWgxPQamGDNW5PEvjDmFwql2uWw/qOuW lkheOIffq8ejUBQFbN8VXfIzzeeKQNFiIcViaqGITjIwhqdHAzVi28OuIGwtdh3g ywLzSC8yvyzgKrNBgtFMr3ucKN0FoPxpBro253xt2H7w8srXW64= =5V9t -----END PGP SIGNATURE----- Merge tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RFDS mitigation from Dave Hansen: "RFDS is a CPU vulnerability that may allow a malicious userspace to infer stale register values from kernel space. Kernel registers can have all kinds of secrets in them so the mitigation is basically to wait until the kernel is about to return to userspace and has user values in the registers. At that point there is little chance of kernel secrets ending up in the registers and the microarchitectural state can be cleared. This leverages some recent robustness fixes for the existing MDS vulnerability. Both MDS and RFDS use the VERW instruction for mitigation" * tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests x86/rfds: Mitigate Register File Data Sampling (RFDS) Documentation/hw-vuln: Add documentation for RFDS x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set |
||
Linus Torvalds
|
685d982112 |
Core x86 changes for v6.9:
- The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code. - Propagate more RIP-relative addressing in assembly code, to generate slightly better code. - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options. - Rework the x86 idle code to cure RCU violations and to clean up the logic. - Clean up the vDSO Makefile logic. - Misc cleanups and fixes. [ Please note that there's a higher number of merge commits in this branch (three) than is usual in x86 topic trees. This happened due to the long testing lifecycle of the percpu changes that involved 3 merge windows, which generated a longer history and various interactions with other core x86 changes that we felt better about to carry in a single branch. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXvB0gRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jUqRAAqnEQPiabF5acQlHrwviX+cjSobDlqtH5 9q2AQy9qaEHapzD0XMOxvFye6XIvehGOGxSPvk6CoviSxBND8rb56lvnsEZuLeBV Bo5QSIL2x42Zrvo11iPHwgXZfTIusU90sBuKDRFkYBAxY3HK2naMDZe8MAsYCUE9 nwgHF8DDc/NYiSOXV8kosWoWpNIkoK/STyH5bvTQZMqZcwyZ49AIeP1jGZb/prbC e/rbnlrq5Eu6brpM7xo9kELO0Vhd34urV14KrrIpdkmUKytW2KIsyvW8D6fqgDBj NSaQLLcz0pCXbhF+8Nqvdh/1coR4L7Ymt08P1rfEjCsQgb/2WnSAGUQuC5JoGzaj ngkbFcZllIbD9gNzMQ1n4Aw5TiO+l9zxCqPC/r58Uuvstr+K9QKlwnp2+B3Q73Ft rojIJ04NJL6lCHdDgwAjTTks+TD2PT/eBWsDfJ/1pnUWttmv9IjMpnXD5sbHxoiU 2RGGKnYbxXczYdq/ALYDWM6JXpfnJZcXL3jJi0IDcCSsb92xRvTANYFHnTfyzGfw EHkhbF4e4Vy9f6QOkSP3CvW5H26BmZS9DKG0J9Il5R3u2lKdfbb5vmtUmVTqHmAD Ulo5cWZjEznlWCAYSI/aIidmBsp9OAEvYd+X7Z5SBIgTfSqV7VWHGt0BfA1heiVv F/mednG0gGc= =3v4F -----END PGP SIGNATURE----- Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ... |
||
Linus Torvalds
|
fcc196579a |
Misc cleanups, including a large series from Thomas Gleixner to
cure Sparse warnings. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXvAFQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hkDRAAwASVCQ88kiGqNQtHibXlK54mAFGsc0xv T8OPds15DUzoLg/y8lw0X0DHly6MdGXVmygybejNIw2BN4lhLjQ7f4Ria7rv7LDy FcI1jfvysEMyYRFHGRefb/GBFzuEfKoROwf+QylGmKz0ZK674gNMngsI9pwOBdbe wElq3IkHoNuTUfH9QA4BvqGam1n122nvVTop3g0PMHWzx9ky8hd/BEUjXFZhfINL zZk3fwUbER2QYbhHt+BN2GRbdf2BrKvqTkXpKxyXTdnpiqAo0CzBGKerZ62H82qG n737Nib1lrsfM5yDHySnau02aamRXaGvCJUd6gpac1ZmNpZMWhEOT/0Tr/Nj5ztF lUAvKqMZn/CwwQky1/XxD0LHegnve0G+syqQt/7x7o1ELdiwTzOWMCx016UeodzB yyHf3Xx9J8nt3snlrlZBaGEfegg9ePLu5Vir7iXjg3vrloUW8A+GZM62NVxF4HVV QWF80BfWf8zbLQ/OS1382t1shaioIe5pEXzIjcnyVIZCiiP2/5kP2O6P4XVbwVlo Ca5eEt8U1rtsLUZaCzI2ZRTQf/8SLMQWyaV+ZmkVwcVdFoARC31EgdE5wYYoZOf6 7Vl+rXd+rZCuTWk0ZgznCZEm75aaqukaQCBa2V8hIVociLFVzhg/Tjedv7s0CspA hNfxdN1LDZc= =0eJ7 -----END PGP SIGNATURE----- Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups, including a large series from Thomas Gleixner to cure sparse warnings" * tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Drop unused declaration of proc_nmi_enabled() x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address() x86/percpu: Cure per CPU madness on UP smp: Consolidate smp_prepare_boot_cpu() x86/msr: Add missing __percpu annotations x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h> perf/x86/amd/uncore: Fix __percpu annotation x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP) x86/apm_32: Remove dead function apm_get_battery_status() x86/insn-eval: Fix function param name in get_eff_addr_sib() |
||
Linus Torvalds
|
38b334fc76 |
- Add the x86 part of the SEV-SNP host support. This will allow the
kernel to be used as a KVM hypervisor capable of running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal of the AMD confidential computing side, providing the most comprehensive confidential computing environment up to date. This is the x86 part and there is a KVM part which did not get ready in time for the merge window so latter will be forthcoming in the next cycle. - Rework the early code's position-dependent SEV variable references in order to allow building the kernel with clang and -fPIE/-fPIC and -mcmodel=kernel - The usual set of fixes, cleanups and improvements all over the place -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXvH0wACgkQEsHwGGHe VUrzmA//VS/n6dhHRnm/nAGngr4PeegkgV1OhyKYFfiZ272rT6P9QvblQrgcY0dc Ij1DOhEKlke51pTHvMOQ33B3P4Fuc0mx3dpCLY0up5V26kzQiKCjRKEkC4U1bcw8 W4GqMejaR89bE14bYibmwpSib9T/uVsV65eM3xf1iF5UvsnoUaTziymDoy+nb43a B1pdd5vcl4mBNqXeEvt0qjg+xkMLpWUI9tJDB8mbMl/cnIFGgMZzBaY8oktHSROK QpuUnKegOgp1RXpfLbNjmZ2Q4Rkk4MNazzDzWq3EIxaRjXL3Qp507ePK7yeA2qa0 J3jCBQc9E2j7lfrIkUgNIzOWhMAXM2YH5bvH6UrIcMi1qsWJYDmkp2MF1nUedjdf Wj16/pJbeEw1aKKIywJGwsmViSQju158vY3SzXG83U/A/Iz7zZRHFmC/ALoxZptY Bi7VhfcOSpz98PE3axnG8CvvxRDWMfzBr2FY1VmQbg6VBNo1Xl1aP/IH1I8iQNKg /laBYl/qP+1286TygF1lthYROb1lfEIJprgi2xfO6jVYUqPb7/zq2sm78qZRfm7l 25PN/oHnuidfVfI/H3hzcGubjOG9Zwra8WWYBB2EEmelf21rT0OLqq+eS4T6pxFb GNVfc0AzG77UmqbrpkAMuPqL7LrGaSee4NdU3hkEdSphlx1/YTo= =c1ps -----END PGP SIGNATURE----- Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add the x86 part of the SEV-SNP host support. This will allow the kernel to be used as a KVM hypervisor capable of running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP is the ultimate goal of the AMD confidential computing side, providing the most comprehensive confidential computing environment up to date. This is the x86 part and there is a KVM part which did not get ready in time for the merge window so latter will be forthcoming in the next cycle. - Rework the early code's position-dependent SEV variable references in order to allow building the kernel with clang and -fPIE/-fPIC and -mcmodel=kernel - The usual set of fixes, cleanups and improvements all over the place * tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/sev: Disable KMSAN for memory encryption TUs x86/sev: Dump SEV_STATUS crypto: ccp - Have it depend on AMD_IOMMU iommu/amd: Fix failure return from snp_lookup_rmpentry() x86/sev: Fix position dependent variable references in startup code crypto: ccp: Make snp_range_list static x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT Documentation: virt: Fix up pre-formatted text block for SEV ioctls crypto: ccp: Add the SNP_SET_CONFIG command crypto: ccp: Add the SNP_COMMIT command crypto: ccp: Add the SNP_PLATFORM_STATUS command x86/cpufeatures: Enable/unmask SEV-SNP CPU feature KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown crypto: ccp: Handle legacy SEV commands when SNP is enabled crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled crypto: ccp: Handle the legacy TMR allocation when SNP is enabled x86/sev: Introduce an SNP leaked pages list crypto: ccp: Provide an API to issue SEV and SNP commands ... |
||
Linus Torvalds
|
720c857907 |
Support for x86 Fast Return and Event Delivery (FRED):
FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs. the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are know outstanding problems. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuKPgTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoWyUEACevJMHU+Ot9zqBPizSWxByM1uunHbp bjQXhaFeskd3mt7k7HU6GsPRSmC3q4lliP1Y9ypfbU0DvYSI2h/PhMWizjhmot2y nIvFpl51r/NsI+JHx1oXcFetz0eGHEqBui/4YQ/swgOCMymYgfqgHhazXTdldV3g KpH9/8W3AeGvw79uzXFH9tjBzTkbvywpam3v0LYNDJWTCuDkilyo8PjhsgRZD4x3 V9f1nLD7nSHZW8XLoktdJJ38bKwI2Lhao91NQ0ErwopekA4/9WphZEKsDpidUSXJ sn1O148oQ8X92IO2OaQje8XC5pLGr5GqQBGPWzRH56P/Vd3+WOwBxaFoU6Drxc5s tIe23ZjkVcpA8EEG7BQBZV1Un/NX7XaCCnMniOt0RauXw+1NaslX7t/tnUAh5F1V TWCH4D0I0oJ0qJ7kNliGn2BP3agYXOVg81xVEUjT6KfHcYU4ImUrwi+BkeNXuXtL Ch5ADnbYAcUjWLFnAmEmaRtfmfNGY5T7PeGFHW2RRkaOJ88v5g14Voo6gPJaDUPn wMQ0nLq1xN4xZWF6ZgfRqAhArvh20k38ZujRku5vXEqnhOugQ76TF2UYiFEwOXbQ 8jcM+yEBLGgBz7tGMwmIAml6kfxaFF1KPpdrtcPxNkGlbE6KTSuIolLx2YGUvlSU 6/O8nwZy49ckmQ== =Ib7w -----END PGP SIGNATURE----- Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ... |
||
Pawan Gupta
|
2a0180129d |
KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
Mitigation for RFDS requires RFDS_CLEAR capability which is enumerated by MSR_IA32_ARCH_CAPABILITIES bit 27. If the host has it set, export it to guests so that they can deploy the mitigation. RFDS_NO indicates that the system is not vulnerable to RFDS, export it to guests so that they don't deploy the mitigation unnecessarily. When the host is not affected by X86_BUG_RFDS, but has RFDS_NO=0, synthesize RFDS_NO to the guest. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> |
||
Paolo Bonzini
|
e9a2bba476 |
KVM Xen and pfncache changes for 6.9:
- Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrblYACgkQOlYIJqCj N/3K4Q/+KZ8lrnNXvdHNCQdosA5DDXpqUcRzhlTUp82fncpdJ0LqrSMzMots2Eh9 KC0jSPo8EkivF+Epug0+bpQBEaLXzTWhRcS1grePCDz2lBnxoHFSWjvaK2p14KlC LvxCJZjxyfLKHwKHpSndvO9hVFElCY3mvvE9KRcKeQAmrz1cz+DDMKelo1MuV8D+ GfymhYc+UXpY41+6hQdznx+WoGoXKRameo3iGYuBoJjvKOyl4Wxkx9WSXIxxxuqG kHxjiWTR/jF1ITJl6PeMrFcGl3cuGKM/UfTOM6W2h6Wi3mhLpXveoVLnqR1kipIj btSzSVHL7C4WTPwOcyhwPzap+dJmm31c6N0uPScT7r9yhs+q5BDj26vcVcyPZUHo efIwmsnO2eQvuw+f8C6QqWCPaxvw46N0zxzwgc5uA3jvAC93y0l4v+xlAQsC0wzV 0+BwU00cutH/3t3c/WPD5QcmRLH726VoFuTlaDufpoMU7gBVJ8rzjcusxR+5BKT+ GJcAgZxZhEgvnzmTKd4Ec/mt+xZ2Erd+kV3MKCHvDPyj8jqy8FQ4DAWKGBR+h3WR rqAs2k8NPHyh3i1a3FL1opmxEGsRS+Cnc6Bi77cj9DxTr22JkgDJEuFR+Ues1z6/ SpE889kt3w5zTo34+lNxNPlIKmO0ICwwhDL6pxJTWU7iWQnKypU= =GliW -----END PGP SIGNATURE----- Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock. |
||
Paolo Bonzini
|
e9025cdd8c |
KVM x86 PMU changes for 6.9:
- Fix several bugs where KVM speciously prevents the guest from utilizing fixed counters and architectural event encodings based on whether or not guest CPUID reports support for the _architectural_ encoding. - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads, priority of VMX interception vs #GP, PMC types in architectural PMUs, etc. - Add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID, i.e. are difficult to validate via KVM-Unit-Tests. - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting cycles, e.g. when checking if a PMC event needs to be synthesized when skipping an instruction. - Optimize triggering of emulated events, e.g. for "count instructions" events when skipping an instruction, which yields a ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest. - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrUFQACgkQOlYIJqCj N/11dhAAnr9e6mPmXvaH4YKcvOGgTmwIQdi5W4IBzGm27ErEb0Vyskx3UATRhRm+ gZyp3wNgEA9LeifICDNu4ypn7HZcl2VtRql6FYcB8Bcu8OiHfU8PhWL0/qrpY20e zffUj2tDweq2ft9Iks1SQJD0sxFkcXIcSKOffP7pRZJHFTKLltGORXwxzd9HJHPY nc4nERKegK2yH4A4gY6nZ0oV5L3OMUNHx815db5Y+HxXOIjBCjTQiNNd6mUdyX1N C5sIiElXLdvRTSDvirHfA32LqNwnajDGox4QKZkB3wszCxJ3kRd4OCkTEKMYKHxd KoKCJQnAdJFFW9xqbT8nNKXZ+hg2+ZQuoSaBuwKryf7jWi0e6a7jcV0OH+cQSZw7 UNudKhs3r4ambfvnFp2IVZlZREMDB+LAjo2So48Jn/JGCAzqte3XqwVKskn9pS9S qeauXCdOLioZALYtTBl8RM1rEY5mbwQrpPv9CzbeU09qQ/hpXV14W9GmbyeOZcI1 T1cYgEqlLuifRluwT/hxrY321+4noF116gSK1yb07x/sJU8/lhRooEk9V562066E qo6nIvc7Bv9gTGLwo6VReKSPcTT/6t3HwgPsRjqe+evso3EFN9f9hG+uPxtO6TUj pdPm3mkj2KfxDdJLf+Ys16gyGdiwI0ZImIkA0uLdM0zftNsrb4Y= =vayI -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD KVM x86 PMU changes for 6.9: - Fix several bugs where KVM speciously prevents the guest from utilizing fixed counters and architectural event encodings based on whether or not guest CPUID reports support for the _architectural_ encoding. - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads, priority of VMX interception vs #GP, PMC types in architectural PMUs, etc. - Add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID, i.e. are difficult to validate via KVM-Unit-Tests. - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting cycles, e.g. when checking if a PMC event needs to be synthesized when skipping an instruction. - Optimize triggering of emulated events, e.g. for "count instructions" events when skipping an instruction, which yields a ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest. - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit. |
||
Paolo Bonzini
|
b00471a552 |
KVM VMX changes for 6.9:
- Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace due to an unexpected VM-Exit while the CPU was vectoring an exception. - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. - Clean up the logic for massaging the passthrough MSR bitmaps when userspace changes its MSR filter. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrUh4ACgkQOlYIJqCj N/2CRA//VKa4KE8zgF3xM6Btyt2NegPgmGYVyhmMHTvARyHDIl5nURy++uXseb4f UXQLGcoGS+CIiaMohQFhCOjoNvv/2LR9P72qVV2WjQjFxVGBchybz8bjrqIDSSvY TuiPJApIfZtLryFFcowo8jLEBQv3JKgfgn9r2hBwVDcYP13wSz0Z4AWntHIqBxNa DW75wo7wnBFzy2RfUdtAgucpbmEihqSoKA+YjUT+0GRLBI7rWbFxdEKqe3BIM/7n 4NoJXbOmw7mlhTNumZYsF5sKiyOihBOdtUL1TDgKWjJScgmwG+KCSvrp5Ko4PZpo uyWWcIbskQ+cTO6dHDoIJTVPsCDxo3PgVJKG1T60CV68NavwxXCUGri1n1ZNyYH/ bXxEW7dTGHX0TDSt3dcyVOYdZFHbaIbqpu1EXlrzBm1hnruQ1C1uBQGHZ/X+Yo86 0qdq9SgXJ48tykr8BDruIHMy0Q8jbXxl67oXR0CdRjJGM+H9f+7RefnRN9wPYFhy n6Hl3kbezwCZb+8RO34Hq2CpKzNlKRHlJDhUq1ypd2vXPw8FDq1aQYKih0jAzyJQ yCdUueBJgo8OJtSL4HGEHvgkLHR4/XERgbCOxpSYNbqIjahAwNtbfHryUnJRWRb5 V3QczG/TtGfEpVblbRzn3Atbft4rM5a9Z3+s0siB6C2w8wyPmZg= =oJso -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.9: - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace due to an unexpected VM-Exit while the CPU was vectoring an exception. - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. - Clean up the logic for massaging the passthrough MSR bitmaps when userspace changes its MSR filter. |
||
Paolo Bonzini
|
41ebae2ecd |
KVM x86 MMU changes for 6.9:
- Clean up code related to unprotecting shadow pages when retrying a guest instruction after failed #PF-induced emulation. - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if a reschedule is needed, e.g. if a high priority task needs to run. Because KVM doesn't support yielding in the middle of processing a zapped non-leaf SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when attempting to schedule in a high priority. - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot. - Allocate write-tracking metadata on-demand to avoid the memory overhead when running kernels built with KVMGT support (external write-tracking enabled), but for workloads that don't use nested virtualization (shadow paging) or KVMGT. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrTH4ACgkQOlYIJqCj N/1q3xAAh3wpUDzRfkNkgGUbulhuJmQ72PiaW3NRoMo/3Rowegsdgt1N3/ec+fcJ Awx0KUM8Cju8O2Zqp6NzKwUkddCni8dHmOa55NJQuK2M1OpnE0RjBB94n+AFJZki mm8wKSKNgjlVeJDG87+RLPnbaeEvqYPp22oNKJyAPsimTbxvmhIqtg8qdyujGPXA Jke7LXgtVGav+nEzXiLh86VU/agoBJc/zt+hiuLvamU5Y8so+zReqFbrDtvsgtpV ryvMbDZxcPXKrsBP+B7syqUAbODcmh/wkzOCZ4Tby5yurEaw1rwpZIH0BRKRgGx2 F2JqWayYsCOsrJ4DwQre8RfLMtbEKB2BBWkZlYyblAy0++1LcTP9pSk5YC5lSL71 5Oszql9DKi10Vq5IfR/ehsr6mHXFr3AB7C7QefiXpytGbObQs8/f/OxinxaEajcs ERBgh+rcQ5p3kfdiHzuQjn7y45J7z21CKVhka4iKJtTxypBK4ZvkDOVqHuHppb5O aw6rC5HR1EKhSW4jz7QWrDExtDZ2X5HeYl8TgfHncSSJRc7urKYcSCHhXJsB6BPs iQf0xbHaIOyH9jmoqLZjz0QZmXB9fydQ/zAlFVXZsrNHvomayVjqrpl8UFTMdhuI zll9ynfRRHMUkIi1YubUlmFMgBeqOXGkfBFh8QUH3+YiI7Cwzh4= =SgFo -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmu-6.9' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.9: - Clean up code related to unprotecting shadow pages when retrying a guest instruction after failed #PF-induced emulation. - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if a reschedule is needed, e.g. if a high priority task needs to run. Because KVM doesn't support yielding in the middle of processing a zapped non-leaf SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when attempting to schedule in a high priority. - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot. - Allocate write-tracking metadata on-demand to avoid the memory overhead when running kernels built with KVMGT support (external write-tracking enabled), but for workloads that don't use nested virtualization (shadow paging) or KVMGT. |
||
Paolo Bonzini
|
c9cd0beae9 |
KVM x86 misc changes for 6.9:
- Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives (though in fairness in KMSAN, it's comically difficult to see that the uninitialized memory is never truly consumed). - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading DR6 and DR7. - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit. This allows VMX to further optimize handling preemption timer exits, and allows SVM to avoid sending a duplicate IPI (SVM also has a need to force an exit). - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, and add WARN to guard against similar bugs. - Provide a dedicated arch hook for checking if a different vCPU was in-kernel (for directed yield), and simplify the logic for checking if the currently loaded vCPU is in-kernel. - Misc cleanups and fixes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrRjQACgkQOlYIJqCj N/2Dzw//b+ptSBAl1kGBRmk/DqsX7J9ZkQYCQOTeh1vXiUM+XRTSQoArN0Oo1roy 3wcEnQ0beVw7jMuzZ8UUuTfU8WUMja/kwltnqXYNHwLnb6yH0I/BIengXWdUdAMc FmgPZ4qJR2IzKYzvDsc3eEQ515O8UHWakyVDnmLBtiakAeBcUTYceHpEEPpzE5y5 ODASTQKM9o/h8R8JwKFTJ8/mrOLNcsu5SycwFdnmubLJCrNWtJWTijA6y1lh6shn hbEJex+ESoC2v8p7IP53u1SGJubVlPajt+RkYJtlEI3WVsevp024eYcF4nb1OjXi qS2Y3W7DQGWvyCBoSzoMY+9nRMgyOOpHYetdiz+9oZOmnjiYWY0ku59U7Gv+Aotj AUbCn4Ry/OpqsuZ7Oo7i3IT8R7uzsTeNNdxhYBn1OQquBEZ0KBYXlZkGfTk9K0t0 Fhka/5Zu6fBlg5J+zCyaXUGmsGWBo/9HxsC5z1JuKo8fatro5qyqYE5KiM01dkqc 6FET6gL+fFprC5c67JGRPdEtk6F9Emb+6oiTTA8/8q8JQQAKiJKk95Nlq7KzPfVS A5RQPTuTJ7acE/5CY4zB1DdxCjqgnonBEA2ULnA/J10Rk8orHJRnGJcEwKEyDrZh HpsxIIqt++i8KffORpCym6zSAVYuQjn1mu7MGth+zuCqhcEpBfc= =GX0O -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.9: - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives (though in fairness in KMSAN, it's comically difficult to see that the uninitialized memory is never truly consumed). - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading DR6 and DR7. - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit. This allows VMX to further optimize handling preemption timer exits, and allows SVM to avoid sending a duplicate IPI (SVM also has a need to force an exit). - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, and add WARN to guard against similar bugs. - Provide a dedicated arch hook for checking if a different vCPU was in-kernel (for directed yield), and simplify the logic for checking if the currently loaded vCPU is in-kernel. - Misc cleanups and fixes. |
||
Paolo Bonzini
|
961e2bfcf3 |
KVM/arm64 updates for 6.9
- Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZepBjgAKCRCivnWIJHzd FnngAP93VxjCkJ+5qSmYpFNG6r0ECVIbLHFQ59nKn0+GgvbPEgEAwt8svdLdW06h njFTpdzvl4Po+aD/V9xHgqVz3kVvZwE= =1FbW -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.9 - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests |
||
Paolo Bonzini
|
233d0bc4d8 |
LoongArch KVM changes for v6.9
1. Set reserved bits as zero in CPUCFG. 2. Start SW timer only when vcpu is blocking. 3. Do not restart SW timer when it is expired. 4. Remove unnecessary CSR register saving during enter guest. -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmXoeWIWHGNoZW5odWFj YWlAa2VybmVsLm9yZwAKCRAChivD8uImehb3D/9C5IrdyU/2f3fEUuuXO0a2ZS1p l2OT+yr7C6/jATokGcd+53CF8MzYawzuAT3tSXYyoqAxRu0HUkvuS1oA/eFM4EwV iIoUC3jnqcsQ5LCPt6yt+Tzgug64Xm5F4btYWIpmXgCJWx/VVG6+z3JarXAfA2it vgVMGgrrfHt68sEsenNFNgiJ5tCCubjR7XFwjM8rsL7AzUDdmXpF7gFyH2Ufgosi a5CxcPPauO1y5ZCGU4JU9QvxnVqW1kt/TRZIGqqGfULtlBSoZbD9zP3OcCQkL+ai SPNxvU5I+BeX6honpmO6aR/F1EphQhRji3ZKxI8UBo4aJD5+FtMG/YOEPI+ZAS0/ JPuWpDqJH46SN3jfKTQay8jXc+mcnOYXJ9Yrixd4UCf66WJit/+BOma/wP638u2j RUzm1kqhNGad6QiDDtSjISM6sg6FozAGc/KhCkWAhV+lHLnfkXtaf3S+GIu5OiWz ETCKlmIGiy0y774+iftlD7RDRGmtrC4cx5ibl7cKKi62Y5vgujCdDofAyYC+D5cW puaIuHOx1hWtPRT9p1WfUL310ED+Qj3N2pDDcJcqdCIiRRZ5l/hxGS7V687a30WV GcegEqh19CjI9KDat4E1ld4jUHJxaFrw3pr2z3SP7cW3IgdngPJL57M0M2jSazaQ 479xZPJ/i4xhJaKACg== =8HOW -----END PGP SIGNATURE----- Merge tag 'loongarch-kvm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.9 * Set reserved bits as zero in CPUCFG. * Start SW timer only when vcpu is blocking. * Do not restart SW timer when it is expired. * Remove unnecessary CSR register saving during enter guest. |
||
Linus Torvalds
|
137e0ec05a |
KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not writable from userspace, so there would be no way to write to a read-only guest_memfd). - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely for development and testing. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD dirty logging test that caused false passes. x86 fixes: - Fix missing marking of a guest page as dirty when emulating an atomic access. - Check for mmu_notifier invalidation events before faulting in the pfn, and before acquiring mmu_lock, to avoid unnecessary work and lock contention with preemptible kernels (including CONFIG_PREEMPT_DYNAMIC in non-preemptible mode). - Disable AMD DebugSwap by default, it breaks VMSA signing and will be re-enabled with a better VM creation API in 6.10. - Do the cache flush of converted pages in svm_register_enc_region() before dropping kvm->lock, to avoid a race with unregistering of the same region and the consequent use-after-free issue. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmXskdYUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroN1TAf/SUGf4QuYG7nnfgWDR+goFO6Gx7NE pJr3kAwv6d2f+qTlURfGjnX929pgZDLgoTkXTNeZquN6LjgownxMjBIpymVobvAD AKvqJS/ECpryuehXbeqlxJxJn+TrxJ5r4QeNILMHc3AOZoiUqM6xl3zFfXWDNWVo IazwT8P3d8wxiHAxv1eG6OVWHxbcg31068FVKRX3f/bWPbVwROJrPkCopmz2BJvU 6KYdYcn2rkpDTEM3ouDC/6gxJ9vpSY3+nW7Q7dNtGtOH2+BddfSA6I0rphCQWCNs uXOxd5bDrC+KmkiULTPostuvwBgIm1k9wC2kW9A4P2VEf6Ay+ZHEdAOBJQ== =+MT/ -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not writable from userspace, so there would be no way to write to a read-only guest_memfd). - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely for development and testing. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD dirty logging test that caused false passes. x86 fixes: - Fix missing marking of a guest page as dirty when emulating an atomic access. - Check for mmu_notifier invalidation events before faulting in the pfn, and before acquiring mmu_lock, to avoid unnecessary work and lock contention with preemptible kernels (including CONFIG_PREEMPT_DYNAMIC in non-preemptible mode). - Disable AMD DebugSwap by default, it breaks VMSA signing and will be re-enabled with a better VM creation API in 6.10. - Do the cache flush of converted pages in svm_register_enc_region() before dropping kvm->lock, to avoid a race with unregistering of the same region and the consequent use-after-free issue" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: SEV: disable SEV-ES DebugSwap by default KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region() KVM: selftests: Add a testcase to verify GUEST_MEMFD and READONLY are exclusive KVM: selftests: Create GUEST_MEMFD for relevant invalid flags testcases KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY KVM: x86: Mark target gfn of emulated atomic instruction as dirty |
||
Paolo Bonzini
|
7d8942d8e7 |
KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXZB/8ACgkQOlYIJqCj N/3XlQ//RIsvqr38k7kELSKhCMyWgF4J57itABrHpMqAZu3gaAo5sETX8AGcHEe5 mxmquxyNQSf4cthhWy1kzxjGCy6+fk+Z0Z7wzfz0Yd5D+FI6vpo3HhkjovLb2gpt kSrHuhJyuj2vkftNvdaz0nHX1QalVyIEnXnR3oqTmxUUsg6lp1x/zr5SP0KBXjo8 ZzJtyFd0fkRXWpA792T7XPRBWrzPV31HYZBLX8sPlYmJATcbIx9rYSThgCN6XuVN bfE6wATsC+mwv5BpCoDFpCKmFcqSqamag9NGe5qE5mOby5DQGYTCRMCQB8YXXBR0 97ppaY9ZJV4nOVjrYJn6IMOSMVNfoG7nTRFfcd0eFP4tlPEgHwGr5BGDaBtQPkrd KcgWJw8nS02eCA2iOE+FtCXvGJwKhTTjQ45w7rU4EcfUk603L5J4GO1ddmjMhPcP upGGcWDK9vCGrSUFTm8pyWp/NKRJPvAQEiQd/BweSk9+isQHTX2RYCQgPAQnwlTS wTg7ZPNSLoUkRYmd6r+TUT32ELJGNc8GLftMnxIwweq6V7AgNMi0HE60eMovuBNO 7DAWWzfBEZmJv+0mNNZPGXczHVv4YvMWysRdKkhztBc3+sO7P3AL1zWIDlm5qwoG LpFeeI3qo3o5ZNaqGzkSop2pUUGNGpWCH46WmP0AG7RpzW/Natw= =M0td -----END PGP SIGNATURE----- Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged. |
||
Paolo Bonzini
|
5abf6dceb0 |
SEV: disable SEV-ES DebugSwap by default
The DebugSwap feature of SEV-ES provides a way for confidential guests to use
data breakpoints. However, because the status of the DebugSwap feature is
recorded in the VMSA, enabling it by default invalidates the attestation
signatures. In 6.10 we will introduce a new API to create SEV VMs that
will allow enabling DebugSwap based on what the user tells KVM to do.
Contextually, we will change the legacy KVM_SEV_ES_INIT API to never
enable DebugSwap.
For compatibility with kernels that pre-date the introduction of DebugSwap,
as well as with those where KVM_SEV_ES_INIT will never enable it, do not enable
the feature by default. If anybody wants to use it, for now they can enable
the sev_es_debug_swap_enabled module parameter, but this will result in a
warning.
Fixes:
|
||
Paolo Bonzini
|
39fee313fd |
Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged. |
||
Paolo Bonzini
|
1b6c146df5 |
Merge tag 'kvm-x86-fixes-6.8-2' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.8, round 2: - When emulating an atomic access, mark the gfn as dirty in the memslot to fix a bug where KVM could fail to mark the slot as dirty during live migration, ultimately resulting in guest data corruption due to a dirty page not being re-copied from the source to the target. - Check for mmu_notifier invalidation events before faulting in the pfn, and before acquiring mmu_lock, to avoid unnecessary work and lock contention. Contending mmu_lock is especially problematic on preemptible kernels, as KVM may yield mmu_lock in response to the contention, which severely degrades overall performance due to vCPUs making it difficult for the task that triggered invalidation to make forward progress. Note, due to another kernel bug, this fix isn't limited to preemtible kernels, as any kernel built with CONFIG_PREEMPT_DYNAMIC=y will yield contended rwlocks and spinlocks. https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com |
||
Peter Xu
|
e72c7c2b88 |
mm/treewide: drop pXd_large()
They're not used anymore, drop all of them. Link: https://lkml.kernel.org/r/20240305043750.93762-10-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
0a845e0f63 |
mm/treewide: replace pud_large() with pud_leaf()
pud_large() is always defined as pud_leaf(). Merge their usages. Chose pud_leaf() because pud_leaf() is a global API, while pud_large() is not. Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
2f709f7bfd |
mm/treewide: replace pmd_large() with pmd_leaf()
pmd_large() is always defined as pmd_leaf(). Merge their usages. Chose pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not. Link: https://lkml.kernel.org/r/20240305043750.93762-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vitaly Kuznetsov
|
4736d85f0d |
KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT
Commit |
||
Vitaly Kuznetsov
|
92e82cf632 |
KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper
Similar to kvm_find_kvm_cpuid_features()/__kvm_find_kvm_cpuid_features(), introduce a helper to search for the specific hypervisor signature in any struct kvm_cpuid_entry2 array, not only in vcpu->arch.cpuid_entries. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240228101837.93642-2-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
David Woodhouse
|
7a36d68065 |
KVM: x86/xen: fix recursive deadlock in timer injection
The fast-path timer delivery introduced a recursive locking deadlock
when userspace configures a timer which has already expired and is
delivered immediately. The call to kvm_xen_inject_timer_irqs() can
call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock,
which is already held in kvm_xen_vcpu_get_attr().
============================================
WARNING: possible recursive locking detected
6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G O
--------------------------------------------
xen_shinfo_test/250013 is trying to acquire lock:
ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm]
but task is already holding lock:
ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm]
Now that the gfn_to_pfn_cache has its own self-sufficient locking, its
callers no longer need to ensure serialization, so just stop taking
kvm->arch.xen.xen_lock from kvm_xen_set_evtchn().
Fixes:
|
||
David Woodhouse
|
66e3cf729b |
KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast
version will always work for physical unicast", justifying its use of
kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails.
In fact that assumption isn't true if X2APIC isn't in use by the guest
and there is (8-bit x)APIC ID aliasing. A single "unicast" destination
APIC ID *may* then be delivered to multiple vCPUs. Remove the warning,
and in fact it might as well just call kvm_irq_delivery_to_apic().
Reported-by: Michal Luczaj <mhal@rbox.co>
Fixes:
|
||
David Woodhouse
|
8e62bf2bfa |
KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
Linux guests since commit |
||
David Woodhouse
|
451a707813 |
KVM: x86/xen: improve accuracy of Xen timers
A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:
root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)
The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.
But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.
It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...
The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.
Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).
Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.
This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.
Fixes:
|
||
Thomas Gleixner
|
65efc4dc12 |
x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
Sparse complains rightfully about the missing declaration which has been placed sloppily into the usage site: bugs.c:2223:6: sparse: warning: symbol 'itlb_multihit_kvm_mitigation' was not declared. Should it be static? Add it to <asm/spec-ctrl.h> where it belongs and remove the one in the KVM code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.787173239@linutronix.de |
||
Sean Christopherson
|
259720c37d |
KVM: VMX: Combine "check" and "get" APIs for passthrough MSR lookups
Combine possible_passthrough_msr_slot() and is_valid_passthrough_msr() into a single function, vmx_get_passthrough_msr_slot(), and have the combined helper return the slot on success, using a negative value to indicate "failure". Combining the operations avoids iterating over the array of passthrough MSRs twice for relevant MSRs. Suggested-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240223202104.3330974-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Andrei Vagin
|
a364c014a2 |
kvm/x86: allocate the write-tracking metadata on-demand
The write-track is used externally only by the gpu/drm/i915 driver. Currently, it is always enabled, if a kernel has been compiled with this driver. Enabling the write-track mechanism adds a two-byte overhead per page across all memory slots. It isn't significant for regular VMs. However in gVisor, where the entire process virtual address space is mapped into the VM, even with a 39-bit address space, the overhead amounts to 256MB. Rework the write-tracking mechanism to enable it on-demand in kvm_page_track_register_notifier. Here is Sean's comment about the locking scheme: The only potential hiccup would be if taking slots_arch_lock would deadlock, but it should be impossible for slots_arch_lock to be taken in any other path that involves VFIO and/or KVMGT *and* can be coincident. Except for kvm_arch_destroy_vm() (which deletes KVM's internal memslots), slots_arch_lock is taken only through KVM ioctls(), and the caller of kvm_page_track_register_notifier() *must* hold a reference to the VM. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrei Vagin <avagin@google.com> Link: https://lore.kernel.org/r/20240213192340.2023366-1-avagin@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Dongli Zhang
|
bab22040d7 |
KVM: VMX: return early if msr_bitmap is not supported
The vmx_msr_filter_changed() may directly/indirectly calls only vmx_enable_intercept_for_msr() or vmx_disable_intercept_for_msr(). Those two functions may exit immediately if !cpu_has_vmx_msr_bitmap(). vmx_msr_filter_changed() -> vmx_disable_intercept_for_msr() -> pt_update_intercept_for_msr() -> vmx_set_intercept_for_msr() -> vmx_enable_intercept_for_msr() -> vmx_disable_intercept_for_msr() Therefore, we exit early if !cpu_has_vmx_msr_bitmap(). Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240223202104.3330974-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Dongli Zhang
|
8e24eeedfd |
KVM: VMX: fix comment to add LBR to passthrough MSRs
According to the is_valid_passthrough_msr(), the LBR MSRs are also
passthrough MSRs, since the commit
|
||
Like Xu
|
812d432373 |
KVM: x86/pmu: Explicitly check NMI from guest to reducee false positives
Explicitly check that the source of external interrupt is indeed an NMI
in kvm_arch_pmi_in_guest(), which reduces perf-kvm false positive samples
(host samples labelled as guest samples) generated by perf/core NMI mode
if an NMI arrives after VM-Exit, but before kvm_after_interrupt():
# test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
# Symbol Overhead sys usr guest sys guest usr
# ....................................... ........ ........ ........ ......... .........
#
# Before:
[g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00%
[g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00%
[g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00%
# After:
[k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00%
[k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00%
[k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00%
In the above case, perf records the samples labelled '[g]', the RIPs behind
the weird samples are actually being queried by perf_instruction_pointer()
after determining whether it's in GUEST state or not, and here's the issue:
If VM-Exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().
During this window, if a PMI occurs on host (since the KVM instructions on
host are being executed), the control flow, with the help of the host NMI
context, will be transferred to perf/core to generate performance samples,
thus perf_instruction_pointer() and perf_guest_get_ip() is called.
Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
cause perf/core to mistakenly assume that the source RIP of the host NMI
belongs to the guest world and use perf_guest_get_ip() to get the RIP of
a vCPU that has already exited by a non-NMI interrupt.
Error samples are recorded and presented to the end-user via perf-report.
Such false positive samples could be eliminated by explicitly determining
if the exit reason is KVM_HANDLING_NMI.
Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI
is cleared, it's also still possible that another PMI is generated on host.
Also for perf/core timer mode, the false positives are still possible since
those non-NMI sources of interrupts are not always being used by perf/core.
For events that are host-only, perf/core can and should eliminate false
positives by checking event->attr.exclude_guest, i.e. events that are
configured to exclude KVM guests should never fire in the guest.
Events that are configured to count host and guest are trickier, perhaps
impossible to handle with 100% accuracy? And regardless of what accuracy
is provided by perf/core, improving KVM's accuracy is cheap and easy, with
no real downsides.
Fixes:
|
||
Oliver Upton
|
284851ee5c |
KVM: Get rid of return value from kvm_arch_create_vm_debugfs()
The general expectation with debugfs is that any initialization failure is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows implementations to return an error and kvm_create_vm_debugfs() allows that to fail VM creation. Change to a void return to discourage architectures from making debugfs failures fatal for the VM. Seems like everyone already had the right idea, as all implementations already return 0 unconditionally. Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev> |
||
Sean Christopherson
|
d02c357e5b |
KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
Retry page faults without acquiring mmu_lock, and without even faulting the page into the primary MMU, if the resolved gfn is covered by an active invalidation. Contending for mmu_lock is especially problematic on preemptible kernels as the mmu_notifier invalidation task will yield mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and ultimately increase the latency of resolving the page fault. And in the worst case scenario, yielding will be accompanied by a remote TLB flush, e.g. if the invalidation covers a large range of memory and vCPUs are accessing addresses that were already zapped. Faulting the page into the primary MMU is similarly problematic, as doing so may acquire locks that need to be taken for the invalidation to complete (the primary MMU has finer grained locks than KVM's MMU), and/or may cause unnecessary churn (getting/putting pages, marking them accessed, etc). Alternatively, the yielding issue could be mitigated by teaching KVM's MMU iterators to perform more work before yielding, but that wouldn't solve the lock contention and would negatively affect scenarios where a vCPU is trying to fault in an address that is NOT covered by the in-progress invalidation. Add a dedicated lockess version of the range-based retry check to avoid false positives on the sanity check on start+end WARN, and so that it's super obvious that checking for a racing invalidation without holding mmu_lock is unsafe (though obviously useful). Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking invalidation in a loop won't put KVM into an infinite loop, e.g. due to caching the in-progress flag and never seeing it go to '0'. Force a load of mmu_invalidate_seq as well, even though it isn't strictly necessary to avoid an infinite loop, as doing so improves the probability that KVM will detect an invalidation that already completed before acquiring mmu_lock and bailing anyways. Do the pre-check even for non-preemptible kernels, as waiting to detect the invalidation until mmu_lock is held guarantees the vCPU will observe the worst case latency in terms of handling the fault, and can generate even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock, detect retry, drop mmu_lock, re-enter the guest, retake the fault, and eventually re-acquire mmu_lock. This behavior is also why there are no new starvation issues due to losing the fairness guarantees provided by rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting on mmu_lock doesn't guarantee forward progress in the face of _another_ mmu_notifier invalidation event. Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE() may generate a load into a register instead of doing a direct comparison (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost is a few bytes of code and maaaaybe a cycle or three. Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com Reported-by: Friedrich Weber <f.weber@proxmox.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Yuan Yao <yuan.yao@linux.intel.com> Cc: Xu Yilun <yilun.xu@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Masahiro Yamada
|
bf48d9b756 |
kbuild: change tool coverage variables to take the path relative to $(obj)
Commit
|
||
Sean Christopherson
|
5ef1d8c1dd |
KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region()
Do the cache flush of converted pages in svm_register_enc_region() before
dropping kvm->lock to fix use-after-free issues where region and/or its
array of pages could be freed by a different task, e.g. if userspace has
__unregister_enc_region_locked() already queued up for the region.
Note, the "obvious" alternative of using local variables doesn't fully
resolve the bug, as region->pages is also dynamically allocated. I.e. the
region structure itself would be fine, but region->pages could be freed.
Flushing multiple pages under kvm->lock is unfortunate, but the entire
flow is a rare slow path, and the manual flush is only needed on CPUs that
lack coherency for encrypted memory.
Fixes:
|
||
Sean Christopherson
|
a1176ef5c9 |
KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU
Advertise and support software-protected VMs if and only if the TDP MMU is
enabled, i.e. disallow KVM_SW_PROTECTED_VM if TDP is enabled for KVM's
legacy/shadow MMU. TDP support for the shadow MMU is maintenance-only,
e.g. support for TDX and SNP will also be restricted to the TDP MMU.
Fixes:
|
||
Sean Christopherson
|
422692098c |
KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP
Rewrite the help message for KVM_SW_PROTECTED_VM to make it clear that
software-protected VMs are a development and testing vehicle for
guest_memfd(), and that attempting to use KVM_SW_PROTECTED_VM for anything
remotely resembling a "real" VM will fail. E.g. any memory accesses from
KVM will incorrectly access shared memory, nested TDP is wildly broken,
and so on and so forth.
Update KVM's API documentation with similar warnings to discourage anyone
from attempting to run anything but selftests with KVM_X86_SW_PROTECTED_VM.
Fixes:
|
||
Sean Christopherson
|
576a15de8d |
KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read
Free TDP MMU roots from vCPU context while holding mmu_lock for read, it is completely legal to invoke kvm_tdp_mmu_put_root() as a reader. This eliminates the last mmu_lock writer in the TDP MMU's "fast zap" path after requesting vCPUs to reload roots, i.e. allows KVM to zap invalidated roots, free obsolete roots, and allocate new roots in parallel. On large VMs, e.g. 100+ vCPUs, allowing the bulk of the "fast zap" operation to run in parallel with freeing and allocating roots reduces the worst case latency for a vCPU to reload a root from 2-3ms to <100us. Link: https://lore.kernel.org/r/20240111020048.844847-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
dab285e4ec |
KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read
Allocate TDP MMU roots while holding mmu_lock for read, and instead use tdp_mmu_pages_lock to guard against duplicate roots. This allows KVM to create new roots without forcing kvm_tdp_mmu_zap_invalidated_roots() to yield, e.g. allows vCPUs to load new roots after memslot deletion without forcing the zap thread to detect contention and yield (or complete if the kernel isn't preemptible). Note, creating a new TDP MMU root as an mmu_lock reader is safe for two reasons: (1) paths that must guarantee all roots/SPTEs are *visited* take mmu_lock for write and so are still mutually exclusive, e.g. mmu_notifier invalidations, and (2) paths that require all roots/SPTEs to *observe* some given state without holding mmu_lock for write must ensure freshness through some other means, e.g. toggling dirty logging must first wait for SRCU readers to recognize the memslot flags change before processing existing roots/SPTEs. Link: https://lore.kernel.org/r/20240111020048.844847-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
f5238c2a60 |
KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read
When allocating a new TDP MMU root, check for a usable root while holding mmu_lock for read and only acquire mmu_lock for write if a new root needs to be created. There is no need to serialize other MMU operations if a vCPU is simply grabbing a reference to an existing root, holding mmu_lock for write is "necessary" (spoiler alert, it's not strictly necessary) only to ensure KVM doesn't end up with duplicate roots. Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and to setups that frequently delete memslots, i.e. which force all vCPUs to reload all roots. Link: https://lore.kernel.org/r/20240111020048.844847-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
d746182337 |
KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs
When write-protecting SPTEs, don't process invalid roots as invalid roots are unreachable, i.e. can't be used to access guest memory and thus don't need to be write-protected. Note, this is *almost* a nop for kvm_tdp_mmu_clear_dirty_pt_masked(), which is called under slots_lock, i.e. is mutually exclusive with kvm_mmu_zap_all_fast(). But it's possible for something other than the "fast zap" thread to grab a reference to an invalid root and thus keep a root alive (but completely empty) after kvm_mmu_zap_all_fast() completes. The kvm_tdp_mmu_write_protect_gfn() case is more interesting as KVM write- protects SPTEs for reasons other than dirty logging, e.g. if a KVM creates a SPTE for a nested VM while a fast zap is in-progress. Add another TDP MMU iterator to visit only valid roots, and opportunistically convert kvm_tdp_mmu_get_vcpu_root_hpa() to said iterator. Link: https://lore.kernel.org/r/20240111020048.844847-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
99b85fda91 |
KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range
When zapping a GFN in response to an APICv or MTRR change, don't zap SPTEs for invalid roots as KVM only needs to ensure the guest can't use stale mappings for the GFN. Unlike kvm_tdp_mmu_unmap_gfn_range(), which must zap "unreachable" SPTEs to ensure KVM doesn't mark a page accessed/dirty, kvm_tdp_mmu_zap_leafs() isn't used (and isn't intended to be used) to handle freeing of host memory. Link: https://lore.kernel.org/r/20240111020048.844847-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
6577f1efdf |
KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators
Modify for_each_tdp_mmu_root() and __for_each_tdp_mmu_root_yield_safe() to accept -1 for _as_id to mean "process all memslot address spaces". That way code that wants to process both SMM and !SMM doesn't need to iterate over roots twice (and likely copy+paste code in the process). Deliberately don't cast _as_id to an "int", just in case not casting helps the compiler elide the "_as_id >=0" check when being passed an unsigned value, e.g. from a memslot. No functional change intended. Link: https://lore.kernel.org/r/20240111020048.844847-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
fcdffe97f8 |
KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots
Don't force a TLB flush when zapping SPTEs in invalid roots as vCPUs can't be actively using invalid roots (zapping SPTEs in invalid roots is necessary only to ensure KVM doesn't mark a page accessed/dirty after it is freed by the primary MMU). Link: https://lore.kernel.org/r/20240111020048.844847-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
8ca983631f |
KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity
Zap invalidated TDP MMU roots at maximum granularity, i.e. with more frequent conditional resched checkpoints, in order to avoid running for an extended duration (milliseconds, or worse) without honoring a reschedule request. And for kernels running with full or real-time preempt models, zapping at 4KiB granularity also provides significantly reduced latency for other tasks that are contending for mmu_lock (which isn't necessarily an overall win for KVM, but KVM should do its best to honor the kernel's preemption model). To keep KVM's assertion that zapping at 1GiB granularity is functionally ok, which is the main reason 1GiB was selected in the past, skip straight to zapping at 1GiB if KVM is configured to prove the MMU. Zapping roots is far more common than a vCPU replacing a 1GiB page table with a hugepage, e.g. generally happens multiple times during boot, and so keeping the test coverage provided by root zaps is desirable, just not for production. Cc: David Matlack <dmatlack@google.com> Cc: Pattara Teerapong <pteerapong@google.com> Link: https://lore.kernel.org/r/20240111020048.844847-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
322d79f1db |
KVM: x86: Clean up directed yield API for "has pending interrupt"
Directly return the boolean result of whether or not a vCPU has a pending interrupt instead of effectively doing: if (true) return true; return false; Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
9b8615c5d3 |
KVM: x86: Rely solely on preempted_in_kernel flag for directed yield
Snapshot preempted_in_kernel using kvm_arch_vcpu_in_kernel() so that the flag is "accurate" (or rather, consistent and deterministic within KVM) for guests with protected state, and explicitly use preempted_in_kernel when checking if a vCPU was preempted in kernel mode instead of bouncing through kvm_arch_vcpu_in_kernel(). Drop the gnarly logic in kvm_arch_vcpu_in_kernel() that redirects to preempted_in_kernel if the target vCPU is not the "running", i.e. loaded, vCPU, as the only reason that code existed was for the directed yield case where KVM wants to check the CPL of a vCPU that may or may not be loaded on the current pCPU. Cc: Like Xu <like.xu.linux@gmail.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
77bcd9e623 |
KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel
Plumb in a dedicated hook for querying whether or not a vCPU was preempted in-kernel. Unlike literally every other architecture, x86's VMX can check if a vCPU is in kernel context if and only if the vCPU is loaded on the current pCPU. x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel as needed. But that's unnecessary, confusing, and fragile, e.g. x86 has had at least one bug where KVM incorrectly used a stale preempted_in_kernel. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
fc3c94142b |
KVM: x86: Sanity check that kvm_has_noapic_vcpu is zero at module_exit()
WARN if kvm.ko is unloaded with an elevated kvm_has_noapic_vcpu to guard against incorrect management of the key, e.g. to detect if KVM fails to decrement the key in error paths. Because kvm_has_noapic_vcpu is purely an optimization, in all likelihood KVM could completely botch handling of kvm_has_noapic_vcpu and no one would notice (which is a good argument for deleting the key entirely, but that's a problem for another day). Note, ideally the sanity check would be performance when kvm_usage_count goes to zero, but adding an arch callback just for this sanity check isn't at all worth doing. Link: https://lore.kernel.org/r/20240209222047.394389-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
a78d904669 |
KVM: x86: Move "KVM no-APIC vCPU" key management into local APIC code
Move incrementing and decrementing of kvm_has_noapic_vcpu into kvm_create_lapic() and kvm_free_lapic() respectively to fix a benign bug where KVM fails to decrement the count if vCPU creation ultimately fails, e.g. due to a memory allocation failing. Note, the bug is benign as kvm_has_noapic_vcpu is used purely to optimize lapic_in_kernel() checks, and that optimization is quite dubious. That, and practically speaking no setup that cares at all about performance runs with a userspace local APIC. Reported-by: Li RongQing <lirongqing@baidu.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Link: https://lore.kernel.org/r/20240209222047.394389-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
0ec3d6d1f1 |
KVM: x86: Fully defer to vendor code to decide how to force immediate exit
Now that vmx->req_immediate_exit is used only in the scope of vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp the VMX preemption to force a VM-Exit and let vendor code fully handle forcing a VM-Exit. Opportunsitically drop __kvm_request_immediate_exit() and just have vendor code call smp_send_reschedule() directly. SVM already does this when injecting an event while also trying to single-step an IRET, i.e. it's not exactly secret knowledge that KVM uses a reschedule IPI to force an exit. Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
7b3d1bbf8d |
KVM: VMX: Handle KVM-induced preemption timer exits in fastpath for L2
Eat VMX treemption timer exits in the fastpath regardless of whether L1 or L2 is active. The VM-Exit is 100% KVM-induced, i.e. there is nothing directly related to the exit that KVM needs to do on behalf of the guest, thus there is no reason to wait until the slow path to do nothing. Opportunistically add comments explaining why preemption timer exits for emulating the guest's APIC timer need to go down the slow path. Link: https://lore.kernel.org/r/20240110012705.506918-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
bf1a49436e |
KVM: x86: Move handling of is_guest_mode() into fastpath exit handlers
Let the fastpath code decide which exits can/can't be handled in the fastpath when L2 is active, e.g. when KVM generates a VMX preemption timer exit to forcefully regain control, there is no "work" to be done and so such exits can be handled in the fastpath regardless of whether L1 or L2 is active. Moving the is_guest_mode() check into the fastpath code also makes it easier to see that L2 isn't allowed to use the fastpath in most cases, e.g. it's not immediately obvious why handle_fastpath_preemption_timer() is called from the fastpath and the normal path. Link: https://lore.kernel.org/r/20240110012705.506918-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
11776aa0cf |
KVM: VMX: Handle forced exit due to preemption timer in fastpath
Handle VMX preemption timer VM-Exits due to KVM forcing an exit in the exit fastpath, i.e. avoid calling back into handle_preemption_timer() for the same exit. There is no work to be done for forced exits, as the name suggests the goal is purely to get control back in KVM. In addition to shaving a few cycles, this will allow cleanly separating handle_fastpath_preemption_timer() from handle_preemption_timer(), e.g. it's not immediately obvious why _apparently_ calling handle_fastpath_preemption_timer() twice on a "slow" exit is necessary: the "slow" call is necessary to handle exits from L2, which are excluded from the fastpath by vmx_vcpu_run(). Link: https://lore.kernel.org/r/20240110012705.506918-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
e6b5d16bbd |
KVM: VMX: Re-enter guest in fastpath for "spurious" preemption timer exits
Re-enter the guest in the fast path if VMX preeemption timer VM-Exit was "spurious", i.e. if KVM "soft disabled" the timer by writing -1u and by some miracle the timer expired before any other VM-Exit occurred. This is just an intermediate step to cleaning up the preemption timer handling, optimizing these types of spurious VM-Exits is not interesting as they are extremely rare/infrequent. Link: https://lore.kernel.org/r/20240110012705.506918-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
9c9025ea00 |
KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepoint
Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to inject an event but needs to first complete some other operation. Knowing that KVM is (or isn't) forcing an exit is useful information when debugging issues related to event injection. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
dfeef3d3f3 |
KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag
Remove reexecute_instruction()'s final check on the MMU being direct, as
EMULTYPE_WRITE_PF_TO_SP is only ever set if the MMU is indirect, i.e. is a
shadow MMU. Prior to commit
|
||
Sean Christopherson
|
515c18a64e |
KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction()
Now that KVM doesn't pointlessly acquire mmu_lock for direct MMUs, drop the dedicated path entirely and always query indirect_shadow_pages when deciding whether or not to try unprotecting the gfn. For indirect, a.k.a. shadow MMUs, checking indirect_shadow_pages is harmless; unless *every* shadow page was somehow zapped while KVM was attempting to emulate the instruction, indirect_shadow_pages is guaranteed to be non-zero. Well, unless the instruction used a direct hugepage with 2-level paging for its code page, but in that case, there's obviously nothing to unprotect. And in the extremely unlikely case all shadow pages were zapped, there's again obviously nothing to unprotect. Link: https://lore.kernel.org/r/20240203002343.383056-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Mingwei Zhang
|
474b99ed70 |
KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic
Drop KVM's completely pointless acquisition of mmu_lock when deciding whether or not to unprotect any shadow pages residing at the gfn before resuming the guest to let it retry an instruction that KVM failed to emulated. In this case, indirect_shadow_pages is used as a coarse-grained heuristic to check if there is any chance of there being a relevant shadow page to unprotected. But acquiring mmu_lock largely defeats any benefit to the heuristic, as taking mmu_lock for write is likely far more costly to the VM as a whole than unnecessarily walking mmu_page_hash. Furthermore, the current code is already prone to false negatives and false positives, as it drops mmu_lock before checking the flag and unprotecting shadow pages. And as evidenced by the lack of bug reports, neither false positives nor false negatives are problematic. A false positive simply means that KVM will try to unprotect shadow pages that have already been zapped. And a false negative means that KVM will resume the guest without unprotecting the gfn, i.e. if a shadow page was _just_ created, the vCPU will hit the same page fault and do the whole dance all over again, and detect and unprotect the shadow page the second time around (or not, if something else zaps it first). Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: drop READ_ONCE() and comment change, rewrite changelog] Link: https://lore.kernel.org/r/20240203002343.383056-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
2a5f091ce1 |
KVM: x86: Open code all direct reads to guest DR6 and DR7
Bite the bullet, and open code all direct reads of DR6 and DR7. KVM currently has a mix of open coded accesses and calls to kvm_get_dr(), which is confusing and ugly because there's no rhyme or reason as to why any particular chunk of code uses kvm_get_dr(). The obvious alternative is to force all accesses through kvm_get_dr(), but it's not at all clear that doing so would be a net positive, e.g. even if KVM ends up wanting/needing to force all reads through a common helper, e.g. to play caching games, the cost of reverting this change is likely lower than the ongoing cost of maintaining weird, arbitrary code. No functional change intended. Cc: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240209220752.388160-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
fc5375dd8c |
KVM: x86: Make kvm_get_dr() return a value, not use an out parameter
Convert kvm_get_dr()'s output parameter to a return value, and clean up most of the mess that was created by forcing callers to provide a pointer. No functional change intended. Acked-by: Mathias Krause <minipli@grsecurity.net> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Paul Durrant
|
003d914220 |
KVM: x86/xen: allow vcpu_info content to be 'safely' copied
If the guest sets an explicit vcpu_info GPA then, for any of the first 32 vCPUs, the content of the default vcpu_info in the shared_info page must be copied into the new location. Because this copy may race with event delivery (which updates the 'evtchn_pending_sel' field in vcpu_info), event delivery needs to be deferred until the copy is complete. Happily there is already a shadow of 'evtchn_pending_sel' in kvm_vcpu_xen that is used in atomic context if the vcpu_info PFN cache has been invalidated so that the update of vcpu_info can be deferred until the cache can be refreshed (on vCPU thread's the way back into guest context). Use this shadow if the vcpu_info cache has been *deactivated*, so that the VMM can safely copy the vcpu_info content and then re-activate the cache with the new GPA. To do this, stop considering an inactive vcpu_info cache as a hard error in kvm_xen_set_evtchn_fast(), and let the existing kvm_gpc_check() fail and kick the vCPU (if necessary). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-21-paul@xen.org [sean: add a bit of verbosity to the changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Paul Durrant
|
615451d8cb |
KVM: x86/xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability
Now that all relevant kernel changes and selftests are in place, enable the new capability. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-17-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Paul Durrant
|
3991f35805 |
KVM: x86/xen: allow vcpu_info to be mapped by fixed HVA
If the guest does not explicitly set the GPA of vcpu_info structure in memory then, for guests with 32 vCPUs or fewer, the vcpu_info embedded in the shared_info page may be used. As described in a previous commit, the shared_info page is an overlay at a fixed HVA within the VMM, so in this case it also more optimal to activate the vcpu_info cache with a fixed HVA to avoid unnecessary invalidation if the guest memory layout is modified. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-14-paul@xen.org [sean: use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Paul Durrant
|
b9220d3279 |
KVM: x86/xen: allow shared_info to be mapped by fixed HVA
The shared_info page is not guest memory as such. It is a dedicated page allocated by the VMM and overlaid onto guest memory in a GFN chosen by the guest and specified in the XENMEM_add_to_physmap hypercall. The guest may even request that shared_info be moved from one GFN to another by re-issuing that hypercall, but the HVA is never going to change. Because the shared_info page is an overlay the memory slots need to be updated in response to the hypercall. However, memory slot adjustment is not atomic and, whilst all vCPUs are paused, there is still the possibility that events may be delivered (which requires the shared_info page to be updated) whilst the shared_info GPA is absent. The HVA is never absent though, so it makes much more sense to use that as the basis for the kernel's mapping. Hence add a new KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA attribute type for this purpose and a KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag to advertize its availability. Don't actually advertize it yet though. That will be done in a subsequent patch, which will also add tests for the new attribute type. Also update the KVM API documentation with the new attribute and also fix it up to consistently refer to 'shared_info' (with the underscore). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-13-paul@xen.org [sean: store "hva" as a user pointer, use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Paul Durrant
|
18b99e4d6d |
KVM: x86/xen: re-initialize shared_info if guest (32/64-bit) mode is set
If the shared_info PFN cache has already been initialized then the content of the shared_info page needs to be re-initialized whenever the guest mode is (re)set. Setting the guest mode is either done explicitly by the VMM via the KVM_XEN_ATTR_TYPE_LONG_MODE attribute, or implicitly when the guest writes the MSR to set up the hypercall page. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-12-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Paul Durrant
|
c01c55a34f |
KVM: x86/xen: separate initialization of shared_info cache and content
A subsequent patch will allow shared_info to be initialized using either a GPA or a user-space (i.e. VMM) HVA. To make that patch cleaner, separate the initialization of the shared_info content from the activation of the pfncache. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-11-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Paul Durrant
|
a4bff3df51 |
KVM: pfncache: remove KVM_GUEST_USES_PFN usage
As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any callers of kvm_gpc_init(), and for good reason: the implementation is incomplete/broken. And it's not clear that there will ever be a user of KVM_GUEST_USES_PFN, as coordinating vCPUs with mmu_notifier events is non-trivial. Remove KVM_GUEST_USES_PFN and all related code, e.g. dropping KVM_GUEST_USES_PFN also makes the 'vcpu' argument redundant, to avoid having to reason about broken code as __kvm_gpc_refresh() evolves. Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage check in hva_to_pfn_retry() and hence the 'usage' argument to kvm_gpc_init() are also redundant. [1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-6-paul@xen.org [sean: explicitly call out that guest usage is incomplete] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Paul Durrant
|
78b74638eb |
KVM: pfncache: add a mark-dirty helper
At the moment pages are marked dirty by open-coded calls to mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot from the cache. After a subsequent patch these may not always be set so add a helper now so that caller will protected from the need to know about this detail. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-5-paul@xen.org [sean: decrease indentation, use gpa_to_gfn()] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Paul Durrant
|
4438355ec6 |
KVM: x86/xen: mark guest pages dirty with the pfncache lock held
Sampling gpa and memslot from an unlocked pfncache may yield inconsistent values so, since there is no problem with calling mark_page_dirty_in_slot() with the pfncache lock held, relocate the calls in kvm_xen_update_runstate_guest() and kvm_xen_inject_pending_events() accordingly. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-4-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Masahiro Yamada
|
cd14b01846 |
treewide: replace or remove redundant def_bool in Kconfig files
'def_bool X' is a shorthand for 'bool' plus 'default X'. 'def_bool' is redundant where 'bool' is already present, so 'def_bool X' can be replaced with 'default X', or removed if X is 'n'. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> |
||
Pawan Gupta
|
43fb862de8 |
KVM/VMX: Move VERW closer to VMentry for MDS mitigation
During VMentry VERW is executed to mitigate MDS. After VERW, any memory access like register push onto stack may put host data in MDS affected CPU buffers. A guest can then use MDS to sample host data. Although likelihood of secrets surviving in registers at current VERW callsite is less, but it can't be ruled out. Harden the MDS mitigation by moving the VERW mitigation late in VMentry path. Note that VERW for MMIO Stale Data mitigation is unchanged because of the complexity of per-guest conditional VERW which is not easy to handle that late in asm with no GPRs available. If the CPU is also affected by MDS, VERW is unconditionally executed late in asm regardless of guest having MMIO access. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-6-a6216d83edb7%40linux.intel.com |
||
Sean Christopherson
|
706a189dcf |
KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
Use EFLAGS.CF instead of EFLAGS.ZF to track whether to use VMRESUME versus VMLAUNCH. Freeing up EFLAGS.ZF will allow doing VERW, which clobbers ZF, for MDS mitigations as late as possible without needing to duplicate VERW for both paths. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-5-a6216d83edb7%40linux.intel.com |
||
Pawan Gupta
|
6613d82e61 |
x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
The VERW mitigation at exit-to-user is enabled via a static branch mds_user_clear. This static branch is never toggled after boot, and can be safely replaced with an ALTERNATIVE() which is convenient to use in asm. Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user path. Also remove the now redundant VERW in exc_nmi() and arch_exit_to_user_mode(). Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com |
||
Sean Christopherson
|
910c57dfa4 |
KVM: x86: Mark target gfn of emulated atomic instruction as dirty
When emulating an atomic access on behalf of the guest, mark the target
gfn dirty if the CMPXCHG by KVM is attempted and doesn't fault. This
fixes a bug where KVM effectively corrupts guest memory during live
migration by writing to guest memory without informing userspace that the
page is dirty.
Marking the page dirty got unintentionally dropped when KVM's emulated
CMPXCHG was converted to do a user access. Before that, KVM explicitly
mapped the guest page into kernel memory, and marked the page dirty during
the unmap phase.
Mark the page dirty even if the CMPXCHG fails, as the old data is written
back on failure, i.e. the page is still written. The value written is
guaranteed to be the same because the operation is atomic, but KVM's ABI
is that all writes are dirty logged regardless of the value written. And
more importantly, that's what KVM did before the buggy commit.
Huge kudos to the folks on the Cc list (and many others), who did all the
actual work of triaging and debugging.
Fixes:
|
||
Paolo Bonzini
|
2f8ebe43a0 |
KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8:
- Remove redundant newlines from error messages. - Delete an unused variable in the AMX test (which causes build failures when compiling with -Werror). - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE, and the test eventually got skipped). - Fix TSC related bugs in several Hyper-V selftests. - Fix a bug in the dirty ring logging test where a sem_post() could be left pending across multiple runs, resulting in incorrect synchronization between the main thread and the vCPU worker thread. - Relax the dirty log split test's assertions on 4KiB mappings to fix false positives due to the number of mappings for memslot 0 (used for code and data that is NOT being dirty logged) changing, e.g. due to NUMA balancing. - Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the function generates boolean values, and all callers treat the return value as a bool). -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmXKupQSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5DiQP/RNSgLrE9+/3oyqo9zpbhio2dKqz4dIk 8Ga1ZE4R89dyMB9jGKtWn3rEkyma3TsB+neVpG9ohHV6j25JJ0vNAkxQu3Gt+gkl uM1lh/IfXPnAKyuy6dW9tpgZYE1v2/KfdWjeEzzxfPjzY/LX3yFiiCKEnUmfjjzZ sSz91nV4KYS4b4xLWTIcBgNJuyLJuL05htTLmCu7t8DKOBHwHxXjSn8qqG8OvAjs FOhf0zgGJKBFdKOw2Y8XeDdKO0RTEyEPHaFILcLEsuhoVIbY5OUmLe32pAFzzMbG hPawUZ5CzC++e339gUgGkRNY80iSnGcYVcZa+ohxOsNBdOWko9z/eGWZUV7qkYDK dkPHMoDnSzUCE2eSYbEB1eR/KOfziJCWMS9SAIJbJxIGb1HYajikwAEZ6FNp3R+u MyCuNlV9TfsGgt4Dx8RctMeH2ROpORRu7h3WPFUBgG2/jOzPk/OR6U8hSzvmhTvL MykZ8IaLmUIYoK/nCY2iwy50lQRxtZ/htqWn3sidCBGY0DXdNlMhvd3Vk9jtUvY5 Fgof0b564eYfk/qO3cMIDd2WFaDejP28JVSn0CNm6z9i54ubCKkSBEb4kTYXXnVK YBHvbZ21Vjg52trudvK5UPt599sxxNBNiSV32ckLFKHS4ZVGSFSBSbsAWiQF157i CbYntmtJhM+D =infW -----END PGP SIGNATURE----- Merge tag 'kvm-x86-selftests-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8: - Remove redundant newlines from error messages. - Delete an unused variable in the AMX test (which causes build failures when compiling with -Werror). - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE, and the test eventually got skipped). - Fix TSC related bugs in several Hyper-V selftests. - Fix a bug in the dirty ring logging test where a sem_post() could be left pending across multiple runs, resulting in incorrect synchronization between the main thread and the vCPU worker thread. - Relax the dirty log split test's assertions on 4KiB mappings to fix false positives due to the number of mappings for memslot 0 (used for code and data that is NOT being dirty logged) changing, e.g. due to NUMA balancing. - Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the function generates boolean values, and all callers treat the return value as a bool). |
||
Paolo Bonzini
|
22d0bc0721 |
KVM x86 fixes for 6.8:
- Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only if the incoming events->nmi.pending is non-zero. If the target vCPU is in the UNITIALIZED state, the spurious request will result in KVM exiting to userspace, which in turn causes QEMU to constantly acquire and release QEMU's global mutex, to the point where the BSP is unable to make forward progress. - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being incorrectly truncated, and ultimately causes KVM to think a fixed counter has already been disabled (KVM thinks the old value is '0'). - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace that is ultimately ignored due to ignore_msrs=true doesn't zero the output as intended. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmXKt90SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5e5wP/jU3Zuul2e7fb4E6RN/GPhAFSTzG7Cwe 4lVSSSPmOQsEXTKwCOMj7fgwF9qVSLzLRi62MKziTJY/1FDsTcI3xlM7nM2wwQC2 26evIzI3qB54rHQdviuh1jwh6scZH7xLw7kANE+8x4skkm6AZB1IUnj3utR3fEPj mIUA5kGQxEAEDrn0TFzrRgIw4JngKjrCwmpT+vbmR37flC+Rwv8jr4JY1E3cBAT3 KEilv3Fg07gbvagWGZNSSUNqQos5MsnLifdryKbA/vuIJf+j/01CMo5KtLKshiaX t4gXPldVZDXdxjH6im0wRAX4s/FpZg3vVje2OxPbzwMVb5+XvLewzjzagQ1lFA3I gsNXF8uGdYn0fb8T/wQG4ulWBw6A844PSmGONCwLDA+GZuL9xjMIK5d1litvb/im bEP1Ahv6UcnDNKHqRzuFXQENiS2uQdJNLs7p291oDNkTm/CGjDUgFXPuaCehWrUf ZZf1dxmIPM/Xt2j19mS/HnTHD114A8t1GTx799kBXbG4x0ScVQclkhRk6yFG3ObA 14uXxxAdEBoZGBJ2yr5FbddvRLswbWugFoxKbtCZ/CHMopOUQcRRmRb7Lm1NHLtg Ae/sHO6gQ1xcrbwpMCq+6RjFK57yW+n1TB8ZTmAE2RQynGqzReSTlUNtfn3yMg4v hz+2zGzezoeN =92ae -----END PGP SIGNATURE----- Merge tag 'kvm-x86-fixes-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD KVM x86 fixes for 6.8: - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only if the incoming events->nmi.pending is non-zero. If the target vCPU is in the UNITIALIZED state, the spurious request will result in KVM exiting to userspace, which in turn causes QEMU to constantly acquire and release QEMU's global mutex, to the point where the BSP is unable to make forward progress. - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being incorrectly truncated, and ultimately causes KVM to think a fixed counter has already been disabled (KVM thinks the old value is '0'). - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace that is ultimately ignored due to ignore_msrs=true doesn't zero the output as intended. |
||
Ingo Molnar
|
4589f199eb |
Merge branch 'x86/bugs' into x86/core, to pick up pending changes before dependent patches
Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Linus Torvalds
|
4356e9f841 |
work around gcc bugs with 'asm goto' with outputs
We've had issues with gcc and 'asm goto' before, and we created a 'asm_volatile_goto()' macro for that in the past: see commits |
||
Paolo Bonzini
|
687d8f4c3d |
Merge branch 'kvm-kconfig'
Cleanups to Kconfig definitions for KVM * replace HAVE_KVM with an architecture-dependent symbol, when CONFIG_KVM may or may not be available depending on CPU capabilities (MIPS) * replace HAVE_KVM with IS_ENABLED(CONFIG_KVM) for host-side code that is not part of the KVM module, so that it is completely compiled out * factor common "select" statements in common code instead of requiring each architecture to specify it |
||
Paolo Bonzini
|
f48212ee8e |
treewide: remove CONFIG_HAVE_KVM
It has no users anymore. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
Paolo Bonzini
|
61df71ee99 |
kvm: move "select IRQ_BYPASS_MANAGER" to common code
CONFIG_IRQ_BYPASS_MANAGER is a dependency of the common code included by CONFIG_HAVE_KVM_IRQ_BYPASS. There is no advantage in adding the corresponding "select" directive to each architecture. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
Paolo Bonzini
|
8886640dad |
kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbol
KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask unused definitions in include/uapi/linux/kvm.h. __KVM_HAVE_READONLY_MEM however was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly return nonzero. This however does not make sense, and it prevented userspace from supporting this architecture-independent feature without recompilation. Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and is only used in virt/kvm/kvm_main.c. Userspace does not need to test it and there should be no need for it to exist. Remove it and replace it with a Kconfig symbol within Linux source code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
Julian Stecklina
|
64435aaa4a |
KVM: x86: rename push to emulate_push for consistency
push and emulate_pop are counterparts. Rename push to emulate_push and harmonize its function signature with emulate_pop. This should remove a bit of cognitive load when reading this code. Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20231009092054.556935-2-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Julian Stecklina
|
6fd1e3963f |
KVM: x86: Clean up partially uninitialized integer in emulate_pop()
Explicitly zero out variables passed to emulate_pop() as output params to harden against consuming uninitialized data, and to make sanitizers happy. Many flows that use emulate_pop() pass an "unsigned long" so as to be able to hold the largest possible operand, but the actual number of bytes written is usually the word with of the vCPU. E.g. if the vCPU is in 16-bit or 32-bit mode (on a 64-bit host), the upper portion of the output param will be uninitialized. Passing around the uninitialized data is benign, as actual KVM usage of the output is also tied to the word width, but passing around uninitialized data makes some sanitizers rightly complain. Note, initializing the data in emulate_pop() is not a safe alternative, e.g. it would result in em_leave() clobbering RBP[31:16] if LEAVE were emulated with a 16-bit stack. Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Link: https://lore.kernel.org/r/20231009092054.556935-1-julian.stecklina@cyberus-technology.de [sean: massage changelog, drop em_popa() variable size change]] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Thomas Prescher
|
03f6298c7c |
KVM: x86/emulator: emulate movbe with operand-size prefix
The MOVBE instruction can come with an operand-size prefix (66h). In this, case the x86 emulation code returns EMULATION_FAILED. It turns out that em_movbe can already handle this case and all that is missing is an entry in respective opcode tables to populate gprefix->pfx_66. Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231212095938.26731-1-julian.stecklina@cyberus-technology.de Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Chao Gao
|
d7f0a00e43 |
KVM: VMX: Report up-to-date exit qualification to userspace
Use vmx_get_exit_qual() to read the exit qualification.
vcpu->arch.exit_qualification is cached for EPT violation only and even
for EPT violation, it is stale at this point because the up-to-date
value is cached later in handle_ept_violation().
Fixes:
|
||
Sean Christopherson
|
fdd58834d1 |
KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES
Return -EINVAL instead of -EBUSY if userspace attempts KVM_SEV{,ES}_INIT on a VM that already has SEV active. Returning -EBUSY is nonsencial as it's impossible to deactivate SEV without destroying the VM, i.e. the VM isn't "busy" in any sane sense of the word, and the odds of any userspace wanting exactly -EBUSY on a userspace bug are minuscule. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240131235609.4161407-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Ashish Kalra
|
0aa6b90ef9 |
KVM: SVM: Add support for allowing zero SEV ASIDs
Some BIOSes allow the end user to set the minimum SEV ASID value
(CPUID 0x8000001F_EDX) to be greater than the maximum number of
encrypted guests, or maximum SEV ASID value (CPUID 0x8000001F_ECX)
in order to dedicate all the SEV ASIDs to SEV-ES or SEV-SNP.
The SEV support, as coded, does not handle the case where the minimum
SEV ASID value can be greater than the maximum SEV ASID value.
As a result, the following confusing message is issued:
[ 30.715724] kvm_amd: SEV enabled (ASIDs 1007 - 1006)
Fix the support to properly handle this case.
Fixes:
|
||
Sean Christopherson
|
466eec4a22 |
KVM: SVM: Use unsigned integers when dealing with ASIDs
Convert all local ASID variables and parameters throughout the SEV code from signed integers to unsigned integers. As ASIDs are fundamentally unsigned values, and the global min/max variables are appropriately unsigned integers, too. Functionally, this is a glorified nop as KVM guarantees min_sev_asid is non-zero, and no CPU supports -1u as the _only_ asid, i.e. the signed vs. unsigned goof won't cause problems in practice. Opportunistically use sev_get_asid() in sev_flush_encrypted_page() instead of open coding an equivalent. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240131235609.4161407-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
cc4ce37bed |
KVM: SVM: Set sev->asid in sev_asid_new() instead of overloading the return
Explicitly set sev->asid in sev_asid_new() when a new ASID is successfully allocated, and return '0' to indicate success instead of overloading the return value to multiplex the ASID with error codes. There is exactly one caller of sev_asid_new(), and sev_asid_free() already consumes sev->asid, i.e. returning the ASID isn't necessary for flexibility, nor does it provide symmetry between related APIs. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240131235609.4161407-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Mathias Krause
|
e1dda3afe2 |
KVM: x86: Fix broken debugregs ABI for 32 bit kernels
The ioctl()s to get and set KVM's debug registers are broken for 32 bit
kernels as they'd only copy half of the user register state because of a
UAPI and in-kernel type mismatch (__u64 vs. unsigned long; 8 vs. 4
bytes).
This makes it impossible for userland to set anything but DR0 without
resorting to bit folding tricks.
Switch to a loop for copying debug registers that'll implicitly do the
type conversion for us, if needed.
There are likely no users (left) for 32bit KVM, fix the bug nonetheless.
Fixes:
|
||
Mathias Krause
|
3376ca3f1a |
KVM: x86: Fix KVM_GET_MSRS stack info leak
Commit |
||
Mingwei Zhang
|
05519c86d6 |
KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl
Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl
when reprogramming fixed counters, as truncating the value results in KVM
thinking fixed counter 2 is already disabled (the bug also affects fixed
counters 3+, but KVM doesn't yet support those). As a result, if the
guest disables fixed counter 2, KVM will get a false negative and fail to
reprogram/disable emulation of the counter, which can leads to incorrect
counts and spurious PMIs in the guest.
Fixes:
|
||
Tanzir Hasan
|
66a5c40f60 |
kernel.h: removed REPEAT_BYTE from kernel.h
This patch creates wordpart.h and includes it in asm/word-at-a-time.h for all architectures. WORD_AT_A_TIME_CONSTANTS depends on kernel.h because of REPEAT_BYTE. Moving this to another header and including it where necessary allows us to not include the bloated kernel.h. Making this implicit dependency on REPEAT_BYTE explicit allows for later improvements in the lib/string.c inclusion list. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Tanzir Hasan <tanzirh@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20231226-libstringheader-v6-1-80aa08c7652c@google.com Signed-off-by: Kees Cook <keescook@chromium.org> |
||
Sean Christopherson
|
83bdfe04c9 |
KVM: x86/pmu: Avoid CPL lookup if PMC enabline for USER and KERNEL is the same
Don't bother querying the CPL if a PMC is (not) counting for both USER and KERNEL, i.e. if the end result is guaranteed to be the same regardless of the CPL. Querying the CPL on Intel requires a VMREAD, i.e. isn't free, and a single CMP+Jcc is cheap. Link: https://lore.kernel.org/r/20231110022857.1273836-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
e35529fb4a |
KVM: x86/pmu: Check eventsel first when emulating (branch) insns retired
When triggering events, i.e. emulating PMC events in software, check for a matching event selector before checking the event is allowed. The "is allowed" check *might* be cheap, but it could also be very costly, e.g. if userspace has defined a large PMU event filter. The event selector check on the other hand is all but guaranteed to be <10 uops, e.g. looks something like: 0xffffffff8105e615 <+5>: movabs $0xf0000ffff,%rax 0xffffffff8105e61f <+15>: xor %rdi,%rsi 0xffffffff8105e622 <+18>: test %rax,%rsi 0xffffffff8105e625 <+21>: sete %al Link: https://lore.kernel.org/r/20231110022857.1273836-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
afda2d7666 |
KVM: x86/pmu: Expand the comment about what bits are check emulating events
Expand the comment about what bits are and aren't checked when emulating PMC events in software. As pointed out by Jim, AMD's mask includes bits 35:32, which on Intel overlap with the IN_TX and IN_TXCP bits (32 and 33) as well as reserved bits (34 and 45). Checking The IN_TX* bits is actually correct, as it's safe to assert that the vCPU can't be in an HLE/RTM transaction if KVM is emulating an instruction, i.e. KVM *shouldn't count if either of those bits is set. For the reserved bits, KVM is has equal odds of being right if Intel adds new behavior, i.e. ignoring them is just as likely to be correct as checking them. Opportunistically explain *why* the other flags aren't checked. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20231110022857.1273836-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
f19063b1ca |
KVM: x86/pmu: Snapshot event selectors that KVM emulates in software
Snapshot the event selectors for the events that KVM emulates in software, which is currently instructions retired and branch instructions retired. The event selectors a tied to the underlying CPU, i.e. are constant for a given platform even though perf doesn't manage the mappings as such. Getting the event selectors from perf isn't exactly cheap, especially if mitigations are enabled, as at least one indirect call is involved. Snapshot the values in KVM instead of optimizing perf as working with the raw event selectors will be required if KVM ever wants to emulate events that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id" entry. Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
d2b321ea93 |
KVM: x86/pmu: Process only enabled PMCs when emulating events in software
Mask off disabled counters based on PERF_GLOBAL_CTRL *before* iterating over PMCs to emulate (branch) instruction required events in software. In the common case where the guest isn't utilizing the PMU, pre-checking for enabled counters turns a relatively expensive search into a few AND uops and a Jcc. Sadly, PMUs without PERF_GLOBAL_CTRL, e.g. most existing AMD CPUs, are out of luck as there is no way to check that a PMC isn't being used without checking the PMC's event selector. Cc: Konstantin Khorenko <khorenko@virtuozzo.com> Link: https://lore.kernel.org/r/20231110022857.1273836-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
e5a65d4f72 |
KVM: x86/pmu: Add macros to iterate over all PMCs given a bitmap
Add and use kvm_for_each_pmc() to dedup a variety of open coded for-loops that iterate over valid PMCs given a bitmap (and because seeing checkpatch whine about bad macro style is always amusing). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
004a0aa56e |
KVM: x86/pmu: Snapshot and clear reprogramming bitmap before reprogramming
Refactor the handling of the reprogramming bitmap to snapshot and clear to-be-processed bits before doing the reprogramming, and then explicitly set bits for PMCs that need to be reprogrammed (again). This will allow adding a macro to iterate over all valid PMCs without having to add special handling for the reprogramming bit, which (a) can have bits set for non-existent PMCs and (b) needs to clear such bits to avoid wasting cycles in perpetuity. Note, the existing behavior of clearing bits after reprogramming does NOT have a race with kvm_vm_ioctl_set_pmu_event_filter(). Setting a new PMU filter synchronizes SRCU _before_ setting the bitmap, i.e. guarantees that the vCPU isn't in the middle of reprogramming with a stale filter prior to setting the bitmap. Link: https://lore.kernel.org/r/20231110022857.1273836-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
b31880ca2f |
KVM: x86/pmu: Move pmc_idx => pmc translation helper to common code
Add a common helper for *internal* PMC lookups, and delete the ops hook and Intel's implementation. Keep AMD's implementation, but rename it to amd_pmu_get_pmc() to make it somewhat more obvious that it's suited for both KVM-internal and guest-initiated lookups. Because KVM tracks all counters in a single bitmap, getting a counter when iterating over a bitmap, e.g. of all valid PMCs, requires a small amount of math, that while simple, isn't super obvious and doesn't use the same semantics as PMC lookups from RDPMC! Although AMD doesn't support fixed counters, the common PMU code still behaves as if there a split, the high half of which just happens to always be empty. Opportunstically add a comment to explain both what is going on, and why KVM uses a single bitmap, e.g. the boilerplate for iterating over separate bitmaps could be done via macros, so it's not (just) about deduplicating code. Link: https://lore.kernel.org/r/20231110022857.1273836-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
be6b067dae |
KVM: x86/pmu: Add common define to capture fixed counters offset
Add a common define to "officially" solidify KVM's split of counters, i.e. to commit to using bits 31:0 to track general purpose counters and bits 63:32 to track fixed counters (which only Intel supports). KVM already bleeds this behavior all over common PMU code, and adding a KVM- defined macro allows clarifying that the value is a _base_, as oppposed to the _flag_ that is used to access fixed PMCs via RDPMC (which perf confusingly calls INTEL_PMC_FIXED_RDPMC_BASE). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
f933b88e20 |
KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled
Move the purging of common PMU metadata from intel_pmu_refresh() to
kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if
the VM is supposed to have a vPMU.
KVM already denies access to the PMU based on kvm->arch.enable_pmu, as
get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already
violates AMD's architecture by not virtualizing a PMU (kernels have long
since learned to not panic when the PMU is unavailable). But configuring
the PMU as if it were enabled causes unwanted side effects, e.g. calls to
kvm_pmu_trigger_event() waste an absurd number of cycles due to the
all_valid_pmc_idx bitmap being non-zero.
Fixes:
|
||
Vitaly Kuznetsov
|
9e62797fd7 |
KVM: x86: Make gtod_is_based_on_tsc() return 'bool'
gtod_is_based_on_tsc() is boolean in nature, i.e. it returns '1' for good clocksources and '0' otherwise. Moreover, its result is used raw by kvm_get_time_and_clockread()/kvm_get_walltime_and_clockread() which are 'bool'. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240109141121.1619463-6-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Kunwu Chan
|
0dbd054699 |
KVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create()
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Note, KMEM_CACHE() uses the required alignment of the struct, '8' as the alignment, whereas KVM's existing code passes '0'. In the end, the two values yield the same result as x86's minimum slab alignment is also '8' (which is not at all coincidental). Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240116100025.95702-1-chentao@kylinos.cn [sean: call out alignment behavior] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Maciej S. Szmigiero
|
d52734d00b |
KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratum
Since commit |
||
Tengfei Yu
|
9e05d9b067 |
KVM: x86: Check irqchip mode before create PIT
As the kvm api(https://docs.kernel.org/virt/kvm/api.html) reads, KVM_CREATE_PIT2 call is only valid after enabling in-kernel irqchip support via KVM_CREATE_IRQCHIP. Without this check, I can create PIT first and enable irqchip-split then, which may cause the PIT invalid because of lacking of in-kernel PIC to inject the interrupt. Signed-off-by: Tengfei Yu <moehanabichan@gmail.com> Message-Id: <20240125050823.4893-1-moehanabichan@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
Xin Li
|
70d0fe5d09 |
KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
When FRED is enabled, call fred_entry_from_kvm() to handle IRQ/NMI in IRQ/NMI induced VM exits. Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231205105030.8698-33-xin3.li@intel.com |
||
Prasad Pandit
|
6231c9e1a9 |
KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu
kvm_vcpu_ioctl_x86_set_vcpu_events() routine makes 'KVM_REQ_NMI'
request for a vcpu even when its 'events->nmi.pending' is zero.
Ex:
qemu_thread_start
kvm_vcpu_thread_fn
qemu_wait_io_event
qemu_wait_io_event_common
process_queued_cpu_work
do_kvm_cpu_synchronize_post_init/_reset
kvm_arch_put_registers
kvm_put_vcpu_events (cpu, level=[2|3])
This leads vCPU threads in QEMU to constantly acquire & release the
global mutex lock, delaying the guest boot due to lock contention.
Add check to make KVM_REQ_NMI request only if vcpu has NMI pending.
Fixes:
|
||
Sean Christopherson
|
a634c76b2c |
KVM: x86/pmu: Explicitly check for RDPMC of unsupported Intel PMC types
Explicitly check for attempts to read unsupported PMC types instead of letting the bounds check fail. Functionally, letting the check fail is ok, but it's unnecessarily subtle and does a poor job of documenting the architectural behavior that KVM is emulating. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
7a0fc734c2 |
KVM: x86/pmu: Treat "fixed" PMU type in RDPMC as index as a value, not flag
Refactor KVM's handling of ECX for RDPMC to treat the FIXED modifier as an explicit value, not a flag (minus one wart). While non-architectural PMUs do use bit 31 as a flag (for "fast" reads), architectural PMUs use the upper half of ECX to encode the type. From the SDM: ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type Note, that the known supported types are 4000H and 2000H, i.e. look a lot like flags, doesn't contradict the above statement that ECX[31:16] holds the type, at least not by any sane reading of the SDM. Keep the explicitly clearing of the FIXED "flag", as KVM subtly relies on that behavior to disallow unsupported types while allowing the correct indices for fixed counters. This wart will be cleaned up in short order. Opportunistically grab the per-type bitmask in the if-else blocks to eliminate the one-off usage of the local "fixed" bool. Reported-by: Jim Mattson <jmattson@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
5728a4a0ea |
KVM: x86/pmu: Disallow "fast" RDPMC for architectural Intel PMUs
Inject #GP on RDPMC if the "fast" flag is set for architectural Intel PMUs, i.e. if the PMU version is non-zero. Per Intel's SDM, and confirmed on bare metal, the "fast" flag is supported only for non-architectural PMUs, and is reserved for architectural PMUs. If the processor does not support architectural performance monitoring (CPUID.0AH:EAX[7:0]=0), ECX[30:0] specifies the index of the PMC to be read. Setting ECX[31] selects “fast” read mode if supported. In this mode, RDPMC returns bits 31:0 of the PMC in EAX while clearing EDX to zero. If the processor does support architectural performance monitoring (CPUID.0AH:EAX[7:0] ≠ 0), ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type. The following PMC types are currently defined: — General-purpose counters use type 0. The index x (to read IA32_PMCx) must be less than the value enumerated by CPUID.0AH.EAX[15:8] (thus ECX[15:8] must be zero). — Fixed-function counters use type 4000H. The index x (to read IA32_FIXED_CTRx) can be used if either CPUID.0AH.EDX[4:0] > x or CPUID.0AH.ECX[x] = 1 (thus ECX[15:5] must be 0). — Performance metrics use type 2000H. This type can be used only if IA32_PERF_CAPABILITIES.PERF_METRICS_AVAILABLE[bit 15]=1. For this type, the index in ECX[15:0] is implementation specific. Opportunistically WARN if KVM ever actually tries to complete RDPMC for a non-architectural PMU, and drop the non-existent "support" for fast RDPMC, as KVM doesn't support such PMUs, i.e. kvm_pmu_rdpmc() should reject the RDPMC before getting to the Intel code. Fixes: |
||
Sean Christopherson
|
d652981db0 |
KVM: x86/pmu: Apply "fast" RDPMC only to Intel PMUs
Move the handling of "fast" RDPMC instructions, which drop bits 63:32 of
the count, to Intel. The "fast" flag, and all modifiers for that matter,
are Intel-only and aren't supported by AMD.
Opportunistically replace open coded bit crud with proper #defines, and
add comments to try and disentangle the flags vs. values mess for
non-architectural vs. architectural PMUs.
Fixes:
|
||
Sean Christopherson
|
7bb7fce136 |
KVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad index
Apply the pre-intercepts RDPMC validity check only to AMD, and rename all
relevant functions to make it as clear as possible that the check is not a
standard PMC index check. On Intel, the basic rule is that only invalid
opcodes and privilege/permission/mode checks have priority over VM-Exit,
i.e. RDPMC with an invalid index should VM-Exit, not #GP. While the SDM
doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a
non-existent MSR as an example where VM-Exit has priority over #GP, and
RDPMC is effectively just a variation of RDMSR.
Manually testing on various Intel CPUs confirms this behavior, and the
inverted priority was introduced for SVM compatibility, i.e. was not an
intentional change for Intel PMUs. On AMD, *all* exceptions on RDPMC have
priority over VM-Exit.
Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0
static call so as to provide a convenient location to document the
difference between Intel and AMD, and to again try to make it as obvious
as possible that the early check is a one-off thing, not a generic "is
this PMC valid?" helper.
Fixes:
|
||
Sean Christopherson
|
ecb490770a |
KVM: x86/pmu: Don't ignore bits 31:30 for RDPMC index on AMD
Stop stripping bits 31:30 prior to validating/consuming the RDPMC index on
AMD. Per the APM's documentation of RDPMC, *values* greater than 27 are
reserved. The behavior of upper bits being flags is firmly Intel-only.
Fixes:
|
||
Sean Christopherson
|
7a277c2241 |
KVM: x86/pmu: Get eventsel for fixed counters from perf
Get the event selectors used to effectively request fixed counters for
perf events from perf itself instead of hardcoding them in KVM and hoping
that they match the underlying hardware. While fixed counters 0 and 1 use
architectural events, as of
|
||
Sean Christopherson
|
61bb2ad795 |
KVM: x86/pmu: Setup fixed counters' eventsel during PMU initialization
Set the eventsel for all fixed counters during PMU initialization, the eventsel is hardcoded and consumed if and only if the counter is supported, i.e. there is no reason to redo the setup every time the PMU is refreshed. Configuring all KVM-supported fixed counter also eliminates a potential pitfall if/when KVM supports discontiguous fixed counters, in which case configuring only nr_arch_fixed_counters will be insufficient (ignoring the fact that KVM will need many other changes to support discontiguous fixed counters). Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
db9e008a0f |
KVM: x86/pmu: Remove KVM's enumeration of Intel's architectural encodings
Drop KVM's enumeration of Intel's architectural event encodings, and instead open code the three encodings (of which only two are real) that KVM uses to emulate fixed counters. Now that KVM doesn't incorrectly enforce the availability of architectural encodings, there is no reason for KVM to ever care about the encodings themselves, at least not in the current format of an array indexed by the encoding's position in CPUID. Opportunistically add a comment to explain why KVM cares about eventsel values for fixed counters. Suggested-by: Jim Mattson <jmattson@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Sean Christopherson
|
cbbd1aa891 |
KVM: x86/pmu: Allow programming events that match unsupported arch events
Remove KVM's bogus restriction that the guest can't program an event whose
encoding matches an unsupported architectural event. The enumeration of
an architectural event only says that if a CPU supports an architectural
event, then the event can be programmed using the architectural encoding.
The enumeration does NOT say anything about the encoding when the CPU
doesn't report support the architectural event.
Preventing the guest from counting events whose encoding happens to match
an architectural event breaks existing functionality whenever Intel adds
an architectural encoding that was *ever* used for a CPU that doesn't
enumerate support for the architectural event, even if the encoding is for
the exact same event!
E.g. the architectural encoding for Top-Down Slots is 0x01a4. Broadwell
CPUs, which do not support the Top-Down Slots architectural event, 0x01a4
is a valid, model-specific event. Denying guest usage of 0x01a4 if/when
KVM adds support for Top-Down slots would break any Broadwell-based guest.
Reported-by: Kan Liang <kan.liang@linux.intel.com>
Closes: https://lore.kernel.org/all/2004baa6-b494-462c-a11f-8104ea152c6a@linux.intel.com
Fixes:
|
||
Sean Christopherson
|
5eb7fcbdea |
KVM: x86/pmu: Always treat Fixed counters as available when supported
Treat fixed counters as available when they are supported, i.e. don't
silently ignore an enabled fixed counter just because guest CPUID says the
associated general purpose architectural event is unavailable.
KVM originally treated fixed counters as always available, but that got
changed as part of a fix to avoid confusing REF_CPU_CYCLES, which does NOT
map to an architectural event, with the actual architectural event used
associated with bit 7, TOPDOWN_SLOTS.
The commit justified the change with:
If the event is marked as unavailable in the Intel guest CPUID
0AH.EBX leaf, we need to avoid any perf_event creation, whether
it's a gp or fixed counter.
but that justification doesn't mesh with reality. The Intel SDM uses
"architectural events" to refer to both general purpose events (the ones
with the reverse polarity mask in CPUID.0xA.EBX) and the events for fixed
counters, e.g. the SDM makes statements like:
Each of the fixed-function PMC can count only one architectural
performance event.
but the fact that fixed counter 2 (TSC reference cycles) doesn't have an
associated general purpose architectural makes trying to apply the mask
from CPUID.0xA.EBX impossible.
Furthermore, the lack of enumeration for an architectural event in CPUID
only means the CPU doesn't officially support the architectural encoding,
i.e. it doesn't mean using the architectural encoding _won't_ work, it
sipmly means there are no guarantees that it will work as expected. E.g.
if KVM is running in a VM that advertises a fixed counters but not the
corresponding architectural event encoding, and perf decides to use a
general purpose counter instead of a fixed counter, odds are very good
that the underlying hardware actually does support the architectrual
encoding, and that programming the encoding will count the right thing.
In other words, asking perf to count the event will probably work, whereas
intentionally doing nothing is obviously guaranteed to fail.
Note, at the time of the change, KVM didn't enforce hardware support, i.e.
didn't prevent userspace from enumerating support in guest CPUID.0xA.EBX
for architectural events that aren't supported in hardware. I.e. silently
dropping the fixed counter didn't somehow protection against counting the
wrong event, it just enforced guest CPUID. And practically speaking, this
issue is almost certainly limited to running KVM on a funky virtual CPU
model. No known real hardware has an asymmetric PMU where a fixed counter
is supported but the associated architectural event is not.
Fixes:
|
||
Brijesh Singh
|
75253db41a |
KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
Implement a workaround for an SNP erratum where the CPU will incorrectly signal an RMP violation #PF if a hugepage (2MB or 1GB) collides with the RMP entry of a VMCB, VMSA or AVIC backing page. When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC backing pages as "in-use" via a reserved bit in the corresponding RMP entry after a successful VMRUN. This is done for _all_ VMs, not just SNP-Active VMs. If the hypervisor accesses an in-use page through a writable translation, the CPU will throw an RMP violation #PF. On early SNP hardware, if an in-use page is 2MB-aligned and software accesses any part of the associated 2MB region with a hugepage, the CPU will incorrectly treat the entire 2MB region as in-use and signal a an RMP violation #PF. To avoid this, the recommendation is to not use a 2MB-aligned page for the VMCB, VMSA or AVIC pages. Add a generic allocator that will ensure that the page returned is not 2MB-aligned and is safe to be used when SEV-SNP is enabled. Also implement similar handling for the VMCB/VMSA pages of nested guests. [ mdr: Squash in nested guest handling from Ashish, commit msg fixups. ] Reported-by: Alper Gun <alpergun@google.com> # for nested VMSA case Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Marc Orr <marcorr@google.com> Signed-off-by: Marc Orr <marcorr@google.com> Co-developed-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240126041126.1927228-22-michael.roth@amd.com |
||
Brijesh Singh
|
1ca5614b84 |
crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
Before SNP VMs can be launched, the platform must be appropriately configured and initialized via the SNP_INIT command. During the execution of SNP_INIT command, the firmware configures and enables SNP security policy enforcement in many system components. Some system components write to regions of memory reserved by early x86 firmware (e.g. UEFI). Other system components write to regions provided by the operation system, hypervisor, or x86 firmware. Such system components can only write to HV-fixed pages or Default pages. They will error when attempting to write to pages in other page states after SNP_INIT enables their SNP enforcement. Starting in SNP firmware v1.52, the SNP_INIT_EX command takes a list of system physical address ranges to convert into the HV-fixed page states during the RMP initialization. If INIT_RMP is 1, hypervisors should provide all system physical address ranges that the hypervisor will never assign to a guest until the next RMP re-initialization. For instance, the memory that UEFI reserves should be included in the range list. This allows system components that occasionally write to memory (e.g. logging to UEFI reserved regions) to not fail due to RMP initialization and SNP enablement. Note that SNP_INIT(_EX) must not be executed while non-SEV guests are executing, otherwise it is possible that the system could reset or hang. The psp_init_on_probe module parameter was added for SEV/SEV-ES support and the init_ex_path module parameter to allow for time for the necessary file system to be mounted/available. SNP_INIT(_EX) does not use the file associated with init_ex_path. So, to avoid running into issues where SNP_INIT(_EX) is called while there are other running guests, issue it during module probe regardless of the psp_init_on_probe setting, but maintain the previous deferrable handling for SEV/SEV-ES initialization. [ mdr: Squash in psp_init_on_probe changes from Tom, reduce proliferation of 'probe' function parameter where possible. bp: Fix 32-bit allmodconfig build. ] Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Co-developed-by: Jarkko Sakkinen <jarkko@profian.com> Signed-off-by: Jarkko Sakkinen <jarkko@profian.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240126041126.1927228-14-michael.roth@amd.com |
||
Nikolay Borisov
|
955997e880 |
KVM: x86: Use mutex guards to eliminate __kvm_x86_vendor_init()
Use the recently introduced guard(mutex) infrastructure acquire and automatically release vendor_module_lock when the guard goes out of scope. Drop the inner __kvm_x86_vendor_init(), its sole purpose was to simplify releasing vendor_module_lock in error paths. No functional change intended. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20231030141728.1406118-1-nik.borisov@suse.com [sean: rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
Linus Torvalds
|
09d1c6a80f |
Generic:
- Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation. There are two non-KVM patches buried in the middle of guest_memfd support: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka). -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmWcMWkUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroO15gf/WLmmg3SET6Uzw9iEq2xo28831ZA+ 6kpILfIDGKozV5safDmMvcInlc/PTnqOFrsKyyN4kDZ+rIJiafJdg/loE0kPXBML wdR+2ix5kYI1FucCDaGTahskBDz8Lb/xTpwGg9BFLYFNmuUeHc74o6GoNvr1uliE 4kLZL2K6w0cSMPybUD+HqGaET80ZqPwecv+s1JL+Ia0kYZJONJifoHnvOUJ7DpEi rgudVdgzt3EPjG0y1z6MjvDBXTCOLDjXajErlYuZD3Ej8N8s59Dh2TxOiDNTLdP4 a4zjRvDmgyr6H6sz+upvwc7f4M4p+DBvf+TkWF54mbeObHUYliStqURIoA== =66Ws -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ... |
||
Breno Leitao
|
aefb2f2e61 |
x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINE
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted a few more uses in comments/messages as well. ] Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ariel Miculas <amiculas@cisco.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org |
||
Linus Torvalds
|
b51cc5d028 |
x86/cleanups changes for v6.8:
- A micro-optimization got misplaced as a cleanup: - Micro-optimize the asm code in secondary_startup_64_no_verify() - Change global variables to local - Add missing kernel-doc function parameter descriptions - Remove unused parameter from a macro - Remove obsolete Kconfig entry - Fix comments - Fix typos, mostly scripted, manually reviewed Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb2i8RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iFIQ//RjqKWmEBfv0UVCNgtRgkUKOvYVkfhC1R FykHWbSE+/oDODS7B+gbWqzl9Fq2Oxx9re4KZuMfnojE96KZ6H1flQn7z3UVRUrf pfMx13E+uyf7qbVZktqH38lUS4s/AHdX2PKCiXlU/0hIkiBdjbAl3ylyqMv7ytIL Fi2N9iYJN+eLlMkc3A5IK83xNiU8rb0gO6Uywn3nUbqadY/YX2gDpND5kfzRIneR lTKy4rX3+E65qYB2Ly1wDr7e0Q0rgaTzPctx6twFrxQXK+MsHiartJhM5juND/tU DEjSW9ISOHlitKEJI/zbdrvJlr5AKDNy2zHYmQQuqY6+YHRamCKqwIjLIPkKj52g lAbosNwvp/o8W3zUHgUfVZR5hVxN863zV2qa/ehoQ3b/9kNjQC8actILjYEgIVu9 av1sd+nETbjCUABIF9H9uAoRbgc+wQs2nupJZrjvginFz8+WVhgaBdJDMYCNAmjc fNMjGtRS7YXiIMj09ZAXFThVW302FdbTgggDh/qlQlDOXFu5HRbyuWR+USr4/jkP qs2G6m/BHDs9HxDRo/no+ccSrUBV5phfhZbO7qwjTf2NJJvPHW+cxGpT00zU2v8A lgfVI7SDkxwbyi1gacJ054GqEhsWuEdi40ikqxjhL8Oq4xwwsey/PiaIxjkDQx92 Gj3XUSDnGEs= =kUav -----END PGP SIGNATURE----- Merge tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: - Change global variables to local - Add missing kernel-doc function parameter descriptions - Remove unused parameter from a macro - Remove obsolete Kconfig entry - Fix comments - Fix typos, mostly scripted, manually reviewed and a micro-optimization got misplaced as a cleanup: - Micro-optimize the asm code in secondary_startup_64_no_verify() * tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arch/x86: Fix typos x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify() x86/docs: Remove reference to syscall trampoline in PTI x86/Kconfig: Remove obsolete config X86_32_SMP x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro x86/mtrr: Document missing function parameters in kernel-doc x86/setup: Make relocated_ramdisk a local variable of relocate_initrd() |
||
Linus Torvalds
|
c604110e66 |
vfs-6.8.misc
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc= =0bRl -----END PGP SIGNATURE----- Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ... |
||
Paolo Bonzini
|
7f26fea9bc |
KVM x86 MMU changes for 6.8:
- Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read versus write so that KVM doesn't pass "bool shared" all over the place just to have precise assertions in paths that don't actually care about whether the caller is a reader or a writer. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW/W4SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5pAQP/27OuHVhtcao9m8GAsm55tMEIDh2n6Ih Yb74SOQhprbaGWyx/J6QdKABzcQX86L3CvCMSy6Ssoz2FeDM8Qs2a80/0jkK0qtI TPXnOsvxh/MgEsC55LOSJZeWM+xwEqaTA2wuPAbvAcJLdAZsSLQQV1XwH6nl4l0l LhcdkUeAKqQ8DbeBvTDyz80zuWaKhQMm5DJOBM6HbqaAdmMpw/hPDtu9b654jC2R Z6Rs5590OFrR13cBAvgCBayvb54pjW9dbm8lfZ4Grcq1I+C0I7Y5jF+DYFjKuTUi N7t+3HrpAM54DWwD95qkTO73g8tMSvwKGKq+OhylZrbPjeqgyYApcUvbxgKmhgGu hBRoAS9D2goJQtFv4TehpAiJIoR6YGX5cCyIIFI50EfsWkifED/b7zUBf2jg0UnV Z7IzvGLQtMAHj6daDcgXxaLVAXQ3v4aQK7DiPuCIiFGcXvedgcGIFjZH4NaVvK2D Y9ygO8PSbN0MbGUSkaS8Urrxmx7l1tv05cardxjSVxnKEx01nYugB0vTM0/o7gSq Bs2gnf+UJrazeabqFuVBYzzuHH7QyjsiQQ0HWoQiq04v8+C5VldwQKk+g8vxLXG1 iYNXW7qgueLxYxRNleLiyfw1HIyspR09d6e67n2729Xxtay0cvL8dC68AhEenb+N fZ2lKFhnZ2Vs =3f2t -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmu-6.8' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.8: - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read versus write so that KVM doesn't pass "bool shared" all over the place just to have precise assertions in paths that don't actually care about whether the caller is a reader or a writer. |
||
Paolo Bonzini
|
3115d2de39 |
KVM Xen change for 6.8:
To workaround Xen guests that don't expect Xen PV clocks to be marked as being based on a stable TSC, add a Xen config knob to allow userspace to opt out of KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't entirely blameless for the buggy guest behavior. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWXASsSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5R54P/iQPQBs4dJmNkPiA6uSq1O5/8hN4P59z aapJNgDiny/D9/zPbOxGWR31W7lvCgiES/lp3KcHZmwbeAwJpdT6a0cJWGRlGuov gccK8AoYcnwSU98sPisnFv7dJ66ogJfXVkPKKaWo+zVW53XUq2XpIie4eWaOweBt QsXpTGYpGajv1Bf/MgRtNtlkVAo1w8XL1L0NWRugzCk2CAYezz8IT1874GNZoJbd GJfVP+76FdNw+4/CxiaBwxP0gHfBIiAsJzGqbmMPhGG2xJn+KGs5FTEf37Pta8cl aMHAq6/JAoabJfP39MexVkopMaFlPbDwIWfkLWf6wSP86KHei+t9kLC0E4/R2NJ+ GKlrBB6Gj+gzFR4fZ75hIwS/4REMt6zVCbS7uSRrCduqrlEFcY5ED2NesoL9wZrB WMDIxIGIVDdRxc9WLypKmBj7KTgL0qXBxnsAcPiDRf1sk6SGajkesWxA1C1Nzo/H yNfqq0gjdPZVB2RIGN6DpWQFu3d+ZQnG2ToKIBW7OkvJ5USYiDSo4VozhESgYHRZ UJDhJ73QYESynClP6ST+9cxNof3FXCEPDeKr5NcmjVZxlJcdeUDNRqv0LUxQ56BI FvHMHtSs4WLYHZZVzsdh+Yhnc9rEGfoL0NwDPBCcOXjuNMvNQmuzSldc/VDGm/qt sCtxYMms5n7u =3v8F -----END PGP SIGNATURE----- Merge tag 'kvm-x86-xen-6.8' of https://github.com/kvm-x86/linux into HEAD KVM Xen change for 6.8: To workaround Xen guests that don't expect Xen PV clocks to be marked as being based on a stable TSC, add a Xen config knob to allow userspace to opt out of KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't entirely blameless for the buggy guest behavior. |
||
Paolo Bonzini
|
8c9244af4d |
KVM SVM changes for 6.8:
- Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - Fix a benign NMI virtualization bug where KVM would unnecessarily intercept IRET when manually injecting an NMI, e.g. when KVM pends an NMI and injects a second, "simultaneous" NMI. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW/9ESHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL50PcP/Rbdf/68/g1m4JQYl8rf2h7BD4PGE5yw ZpeXSkeZmzyRYPiJjJaZLcvvezyusIPoGRfmsKgj2nI7LCSVyHDmaHVp2h854Xz8 kSWmK5znBYDx+vUqhIKEN2nwFNYSUaSqcRZWvoXi0BzalWlwCgK2yu8xeRDUhn4B +gDKlqZuJMYY1J3V8e64ZkvdxRHsw0WyvD0Ns4EgCe/2v5V9gc08a7vuSq80EtaE yf0cZmubDwuV96LfZnDkZnZpm4C1GNeLxAN1wlj7J6fAvrCAggetDtkJtWCd8yd0 0ZtfjBOMVsCDWQsYXbwGGKdeynzATxc354k6yHBIO863z+M5MtEMKlFNCclrakMO RHfofZHhL+hn3ACESJPcse3ei0VbV28cL2NFdstUEukvZQoacIH9fz7+1GuWqBpv Vb9UJDde029HHsGf+n8LtfQsqV7/8aLV+/4bpiPOHQU+tzAJVxni/H9nJ+7V0lxd NfhWME1lEsQWxpBpcXcVB7D7+ri1Wd9eB4IR9xc/VqgLE1Nj4kIZqtOJF9lbY3wk +H/Ze/MNNg6E9yIErSIv7sWdrvoOPYWZdGCT8Fhm4OILAsDEO96z7WoVF0eWCdJ1 xDIFGXNFuyIpVOqk/JZE/Lv5U1C4xhyFQCmk6gXDgepnTn4d8gx3S79iUfXD32gE GqAjV9Wwmz+o =mXEf -----END PGP SIGNATURE----- Merge tag 'kvm-x86-svm-6.8' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 6.8: - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - Fix a benign NMI virtualization bug where KVM would unnecessarily intercept IRET when manually injecting an NMI, e.g. when KVM pends an NMI and injects a second, "simultaneous" NMI. |
||
Paolo Bonzini
|
8ecb10bcbf |
KVM x86 support for virtualizing Linear Address Masking (LAM)
Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+k4SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5KygQAKTSEmfdox6MSYzGVzAVHBD/8oSTZAGf 4l96Np3sZiX0ujWP7aW1GaIdGL27Yf1bQrKIrODR4xepaosVPpoZZbnLFQ4Jm16D OuwEQL06LV91Lv5XuPkNdq3nMVi1X3wjiKLvP451oCGv8JdxsjXSlFr8ZmDoCfmS NCjkPyitdK+/xOMY5WcrkHD/6VMMiM+5A+CrG7DkaTaqBJQSUXG1NvTKhhxey6Rq OZv0GPv7QVMhHv1NX0Y3LyoiGyWXAoFRnbk/N3yVBOnXcpJ+HBwWiNLRpxmZOQj/ CTo0VvUH/ZkN6zGvAb75/9puFHNliA/QCW1hp+ShXnNdn1eNdS7nhhPrzVqtCTy2 QeNWM/z5v9Wa1norPqDxzqWlh2bWW8JU0soX7Q+quN0d7YjVvmmUluL3Lw/V2zmb gFM2ZY43QHlmLVic4sSraK1LEcYFzjexzpTLhee2gNp+l2y0D0c1/hXukCk6YNUM gad9DH8P9d7By7Eyr0ZaPHSJbuBW1PqZhot5gCg9nCn4pnT2/y7wXsLj6VAw8gdr dWNu2MZWDuH0/d4aKfw2veAECbHUK2daok4ufPDj5nYLVVWCs4HU0U7HlYL2CX7/ TdWOCwtpFtKoN1NHz8mpET7xldxLPnFkByL+SxypTZurAZXoSnEG71IbO5pJ2iIf wHQkXgM+XimA =qUZ2 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEAD KVM x86 support for virtualizing Linear Address Masking (LAM) Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc. |
||
Paolo Bonzini
|
01edb1cfbd |
KVM x86 PMU changes for 6.8:
- Fix a variety of bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW/rsSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5Q8gQAJc4y9NOd09kYXpI+DhkTVe6v07dmYds NzBI2uViqxXFwA5pTs5VTVVYAl1FEmK6NvIVnJdc3epSYRSqyaeN/Z2NoulNxekj /jLA/aA4+dTeJf2lfMFeH65IIuSJhuhyGeZV31RfW3NzEmlglcsb74QkHnJB8rLQ RFJXZcOxSSap72AWxKmxk0alRaI6ONZ9NyqOWFWjZdQuAE7id9Ae5OixKUrlJkmR 6CbY8ra51MFIXQEsomVlcl5b1DNiv0drPPf5YaC9T4CERtt5yZxpvZeTPhq70evm OutoZpzfi69cF1fFCxqN5cWZSt1C/Bu3xp8+ILI1+bZkMCV/ty85DU6hfMZQZzcV JeJkRg/AAgOrG4dtHskwg9LDMs867kgbaqZ8l8K7Dt8rGmcLc5/rZ1ZdjTStFj6V ukmVKMAVgkmh88u62wQ5HjrN1IE1oE6nmDp3zivfPuohEr49A8mAT02A2x9AVxAr HvmwfDMA92xOGSRAN9Gt0mbOA+G0WZe4A36XgPEXloYeskYZgHzgW2hT6VWTd86O ydU9s4L8g+Fy4jcObAiKsT8YwFgAMfVXZKTXvuTME4m/WUNBCrYCwqEOp/NM5qrk qYWVXxOMMjZo71tQfvSPu1TWCtW/4ckvmqMrdQosgwLFy5pSqgXEwTruDvbJ1KWU KhIWVbUfmgFA =+Emh -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEAD KVM x86 PMU changes for 6.8: - Fix a variety of bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. |
||
Paolo Bonzini
|
33d0403fda |
KVM x86 misc changes for 6.8:
- Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW+7sSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5/pwQAL8jIapIWP54VWxWlcTZFtCptGSobGlv cBS4L091/bYuMB/jO0pPtD+apzsYt3WmJ+tRsNA7Yctzh9BDE3XxbV7pKVIUpz9P TLCtYU2hPzp3vC6WCryjtU0OHxEnYMGHE1RCB7/bRblz+q6td7+MLZHcEUdwv83l 3pVM5+tNyQBog40frEVf+z7wrXzz2FgnauJn70X1UUs40VuiTzi6FqfLn6QK95xQ 8QPpjGFep7wQ6RgC4cPKiWSaP5PypCCpr4lMSKrKAf4iaKJdO1CYxEPeu0LcyFhR DUM3zb+AZ/FVrisRWUnjke4Epb87ikoMQBlflrI9+o4cNJQaxEHAzTMGO+u4oucy KwnXtNYM3lKGvDEvoUSBDphNayzcchn+0qk8YKB+XvClYSOtGi+NsWUB4x+M6crM 960cidF/CzYZL/IDj9GW2Tb+IiPJarmazdbqDmMpQiAKz0KE3tezGiysB6d6VJs1 V+KWOaSzAT9GsBKvGnPDHQaZ20vK+YsGB/TMWvpg3rFLTyV5QFM17UNdXyJlX0g8 G0v+gf7j3MKm156H2yYW0XhIAfhstc1Xb8fTDQjJ3pZn6us2NAtFgnrIpbL31Z7E yaSgZuxetswbNwVSECUGlH4/zAtQudBfAt837Nu4eSCjMrJE4SPrrwpbTqp0SPXd 1VZbGc70QFf7 =O4hV -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.8: - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. |
||
Paolo Bonzini
|
0afdfd85e3 |
KVM x86 Hyper-V changes for 6.8:
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW8gYSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5sGUP/iadHMz7Up1X29IDGtq58LRORNVXp2Ln 2dqoj8IKZeSr+mPMw2GvZyuiLqVPMs4Et21WJfCO7HgKd/NPMDORwRndhJYweFRY yk+5NJLvXYuo8UR3b2QYy8XUghEqP+j5eYyon6UdCiPACcBGTpgoj4pU7SLM7l4T EOge42ya5YxD/1oWr5vyifNrOJCPNTBYcC0as5//+RdnmQYqYZ26Z73b0B8Pdct4 XMWwgoKlmLTmei0YntXtGaDGimCvTYP8EPM4tOWgiBSWMhQXWbAh/0biDfd3eZVO Hoe4HvstdjUNbpO3h3Zo78Ob7ehk4kx/6r0nlQnz5JxzGnuDjYCDIVUlYn0mw5Yi nu4ztr8M3VRksDbpmAjSO9XFEKIYxlYQfzZ1UuTy8ehdBYTDl/3lPAbh2ApUYE72 Tt2PXmFGz2j1sjG38Gh94s48Za5OxHoVlfq8iGhU4v7UjuxnMNHfExOWd66SwZgx 5tZkr4rj/pWt21wr7jaVqFGzuftIC5G4ZEBhh7JcW89oamFrykgQUu5z4dhBMO75 G7DAVh9eSH2SKkmJH1ClXriveazTK7fqMx8sZzzRnusMz09qH7SIdjSzmp7H5utw pWBfatft0n0FTI1r+hxGueiJt7dFlrIz0Q4hHyBN4saoVH121bZioc0pq1ob6MIk Y2Ou4xJBt14F =bjfs -----END PGP SIGNATURE----- Merge tag 'kvm-x86-hyperv-6.8' of https://github.com/kvm-x86/linux into HEAD KVM x86 Hyper-V changes for 6.8: - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. |
||
Paolo Bonzini
|
fb872da8e7 |
Common KVM changes for 6.8:
- Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW8F4SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5urcP/Rex6Too26aHJXelUVHlFOGw3hfOnvbq Wr/P3kPqB/1Mncx3aiYTpEvUxFjVTvIkMB5dWba39Eq/G1BbOT2CAHCunlvKJrXy L83YgOl17QtZZJS1KmLTRCj1umfl4Z0c+GEIH+P1FOuOmllNXlLJ1+GWmolP6LLf u4DF2/tyVZf8JXXeJWYITHsU0YQQ0MhHgYL8/aMYJK8epNFpR3wKIqT3428ASxV3 Ru4WH7jpYkFF7PaKbvjKdepr+1wyVt4PXJDDpciCScz45/8eebgfylLJbMglpsR1 JSUTzd6KdCbekgzp51NnRdoIxP+MXgKA3dIuzXKyIDzm2Xq6tna87ve/aWDGw8JC nUMkP/vAuaKT+/QTOwskGAvK2GYDQD1UwVcFNLi12Iis50H0qPwcxsUionQuZgUC ykCmY4N31rSX4DhPg1WLiqsvC/EeDhfXprYrfSd4HQq08NgD45orRJw0Kov+shcS xijIlE1e3aVJMRrbfoSWyc4m79AcooxjYwojQC1Ayqsq0ZTTzzIpd6rqjmY+LbLL aP/wNz8hCfMhFekUV7dDk9rMdZY+bBnTiolyKAN66E6EnPYfl2EdrDEGnZOCPXF4 L/O/kMCXHE90cszzrmiR40yNHLkPelij8sK+ligE4JpqteQ7ia/knh8YAiPBxDw6 XcIfftXMm5XG =wpT4 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-generic-6.8' of https://github.com/kvm-x86/linux into HEAD Common KVM changes for 6.8: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. |