linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-16 08:44:21 +08:00

Author	SHA1	Message	Date
Josh Poimboeuf	4208d2d798	x86/head: Mark *_start_kernel() __noreturn Now that start_kernel() is __noreturn, mark its chain of callers __noreturn. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/c2525f96b88be98ee027ee0291d58003036d4120.1681342859.git.jpoimboe@kernel.org	2023-04-14 17:31:24 +02:00
Josh Poimboeuf	4a2c3448ed	x86/linkage: Fix padding for typed functions CFI typed functions are failing to get padded properly for CONFIG_CALL_PADDING. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/721f0da48d2a49fe907225711b8b76a2b787f9a8.1681331135.git.jpoimboe@kernel.org	2023-04-14 16:08:30 +02:00
Joerg Roedel	e51b419839	Merge branches 'iommu/fixes', 'arm/allwinner', 'arm/exynos', 'arm/mediatek', 'arm/omap', 'arm/renesas', 'arm/rockchip', 'arm/smmu', 'ppc/pamu', 'unisoc', 'x86/vt-d', 'x86/amd', 'core' and 'platform-remove_new' into next	2023-04-14 13:45:50 +02:00
Nick Alcock	569e4d25f4	x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules Since commit `8b41fc4454` ("kbuild: create modules.builtin without Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations are used to identify modules. As a consequence, uses of the macro in non-modules will cause modprobe to misidentify their containing object file as a module when it is not (false positives), and modprobe might succeed rather than failing with a suitable error message. So remove it in the files in this commit, none of which can be built as modules. Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Suggested-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: linux-modules@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: x86@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-04-13 13:13:54 -07:00
Nick Alcock	b71e2a69d1	crypto: blake2s: remove module_init and module.h inclusion Now this can no longer be built as a module, drop all remaining module-related code as well. Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Suggested-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: linux-modules@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: linux-crypto@vger.kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-04-13 13:13:51 -07:00
Nick Alcock	41a98c68f2	crypto: remove MODULE_LICENSE in non-modules Since commit `8b41fc4454` ("kbuild: create modules.builtin without Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations are used to identify modules. As a consequence, uses of the macro in non-modules will cause modprobe to misidentify their containing object file as a module when it is not (false positives), and modprobe might succeed rather than failing with a suitable error message. So remove it in the files in this commit, none of which can be built as modules. Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Suggested-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: linux-modules@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: linux-crypto@vger.kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-04-13 13:13:51 -07:00
Matija Glavinic Pecotic	775d3c514c	x86/rtc: Remove __init for runtime functions set_rtc_noop(), get_rtc_noop() are after booting, therefore their __init annotation is wrong. A crash was observed on an x86 platform where CMOS RTC is unused and disabled via device tree. set_rtc_noop() was invoked from ntp: sync_hw_clock(), although CONFIG_RTC_SYSTOHC=n, however sync_cmos_clock() doesn't honour that. Workqueue: events_power_efficient sync_hw_clock RIP: 0010:set_rtc_noop Call Trace: update_persistent_clock64 sync_hw_clock Fix this by dropping the __init annotation from set/get_rtc_noop(). Fixes: `c311ed6183` ("x86/init: Allow DT configured systems to disable RTC at boot time") Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nokia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/59f7ceb1-446b-1d3d-0bc8-1f0ee94b1e18@nokia.com	2023-04-13 14:41:04 +02:00
Saurabh Sengar	5af507bef9	x86/ioapic: Don't return 0 from arch_dynirq_lower_bound() arch_dynirq_lower_bound() is invoked by the core interrupt code to retrieve the lowest possible Linux interrupt number for dynamically allocated interrupts like MSI. The x86 implementation uses this to exclude the IO/APIC GSI space. This works correctly as long as there is an IO/APIC registered, but returns 0 if not. This has been observed in VMs where the BIOS does not advertise an IO/APIC. 0 is an invalid interrupt number except for the legacy timer interrupt on x86. The return value is unchecked in the core code, so it ends up to allocate interrupt number 0 which is subsequently considered to be invalid by the caller, e.g. the MSI allocation code. The function has already a check for 0 in the case that an IO/APIC is registered, as ioapic_dynirq_base is 0 in case of device tree setups. Consolidate this and zero check for both ioapic_dynirq_base and gsi_top, which is used in the case that no IO/APIC is registered. Fixes: `3e5bedc2c2` ("x86/apic: Fix arch_dynirq_lower_bound() bug for DT enabled machines") Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/1679988604-20308-1-git-send-email-ssengar@linux.microsoft.com	2023-04-12 17:45:50 +02:00
Linus Torvalds	e62252bc55	pci-v6.3-fixes-2 -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmQ1qFYUHGJoZWxnYWFz QGdvb2dsZS5jb20ACgkQWYigwDrT+vxRJA/+JnDrZ8Ag8sIGgI7meuqWcLe3dHXQ 7KaOGM/MP0ir8qJwiobuon2ml1lEU+iWZEcpFH9b9b5iurAfbvZkPhFAWZzJG3GW B6qEX5zaK4tNJHC8byK4pSat2pJLg8Q6xwA6c3H9264tj4Dc83dg2Y8hWWozbU2l xFaawRexCus7Xzd2VEfskwOpHQpSYkWCDtTFclIt1iZeelj6an/0FVdvYqvvIXi+ QZjS80tMeZIcKJq06023KJX36lZHf0RGzYzQTyr7asiNnXBnQ1DfYIASwx1Nf6Y3 9PpvoDVbUGnLDAB8THcOM9j9UOrhlbOJ+4b60J0b4nsT0ORn4rCrN05K0DIIXZi0 CsP3qR2l04xYZFElWrvnA30KS76G7vq0OeAGpQsE0nB1PgydXuiXk5lniQhsrOuj xRiY7YeDFwiNzpjWsVtqkANvl6tA53ZUn+AMEPfIpdYbjnbLcSSF/qY3WCeKoftx KUg6fVOoXbbSctK5rXQnj2yhwQZSkm3DrG9Ol2i3E524vcMKs28QBBuUN+gtL3TP aD6nYP2Q0gItYAQCMtJHCBzfstrO5l9JSLM6xUHR/GwiaLCC9+QRdLJFcAxZvZJi buGs09B2Xz8AbgQ3AejEw7UXk+4JP19ZLyznBhLaaQQO9YOvDYlcn3Xw1QGvJGY/ qsYKElgrLWSz+50= =2gjK -----END PGP SIGNATURE----- Merge tag 'pci-v6.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci fixes from Bjorn Helgaas: - Provide pci_msix_can_alloc_dyn() stub when CONFIG_PCI_MSI unset to avoid build errors (Reinette Chatre) - Quirk AMD XHCI controller that loses MSI-X state in D3hot to avoid broken USB after hotplug or suspend/resume (Basavaraj Natikar) - Fix use-after-free in pci_bus_release_domain_nr() (Rob Herring) * tag 'pci-v6.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: PCI: Fix use-after-free in pci_bus_release_domain_nr() x86/PCI: Add quirk for AMD XHCI controller that loses MSI-X state in D3hot PCI/MSI: Provide missing stub for pci_msix_can_alloc_dyn()	2023-04-11 11:59:49 -07:00
Ron Lee	606012ddde	PCI: Fix up L1SS capability for Intel Apollo Lake Root Port On Google Coral and Reef family Chromebooks with Intel Apollo Lake SoC, firmware clobbers the header of the L1 PM Substates capability and the previous capability when returning from D3cold to D0. Save those headers at enumeration-time and restore them at resume. [bhelgaas: The main benefit is to make the lspci output after resume correct. Apparently there's little or no effect on power consumption.] Link: https://lore.kernel.org/linux-pci/CAFJ_xbq0cxcH-cgpXLU4Mjk30+muWyWm1aUZGK7iG53yaLBaQg@mail.gmail.com/T/#u Link: https://lore.kernel.org/r/20230411160213.4453-1-ron.lee@intel.com Signed-off-by: Ron Lee <ron.lee@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>	2023-04-11 13:06:06 -05:00
Sean Christopherson	55cd57b596	KVM: x86: Filter out XTILE_CFG if XTILE_DATA isn't permitted Filter out XTILE_CFG from the supported XCR0 reported to userspace if the current process doesn't have access to XTILE_DATA. Attempting to set XTILE_CFG in XCR0 will #GP if XTILE_DATA is also not set, and so keeping XTILE_CFG as supported results in explosions if userspace feeds KVM_GET_SUPPORTED_CPUID back into KVM and the guest doesn't sanity check CPUID. Fixes: `445ecdf79b` ("kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID") Reported-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Aaron Lewis <aaronlewis@google.com> Tested-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20230405004520.421768-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-11 10:19:03 -07:00
Aaron Lewis	6be3ae45f5	KVM: x86: Add a helper to handle filtering of unpermitted XCR0 features Add a helper, kvm_get_filtered_xcr0(), to dedup code that needs to account for XCR0 features that require explicit opt-in on a per-process basis. In addition to documenting when KVM should/shouldn't consult xstate_get_guest_group_perm(), the helper will also allow sanitizing the filtered XCR0 to avoid enumerating architecturally illegal XCR0 values, e.g. XTILE_CFG without XTILE_DATA. No functional changes intended. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> [sean: rename helper, move to x86.h, massage changelog] Reviewed-by: Aaron Lewis <aaronlewis@google.com> Tested-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20230405004520.421768-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-11 10:19:03 -07:00
Sean Christopherson	4984563823	KVM: nVMX: Emulate NOPs in L2, and PAUSE if it's not intercepted Extend VMX's nested intercept logic for emulated instructions to handle "pause" interception, in quotes because KVM's emulator doesn't filter out NOPs when checking for nested intercepts. Failure to allow emulation of NOPs results in KVM injecting a #UD into L2 on any NOP that collides with the emulator's definition of PAUSE, i.e. on all single-byte NOPs. For PAUSE itself, honor L1's PAUSE-exiting control, but ignore PLE to avoid unnecessarily injecting a #UD into L2. Per the SDM, the first execution of PAUSE after VM-Entry is treated as the beginning of a new loop, i.e. will never trigger a PLE VM-Exit, and so L1 can't expect any given execution of PAUSE to deterministically exit. ... the processor considers this execution to be the first execution of PAUSE in a loop. (It also does so for the first execution of PAUSE at CPL 0 after VM entry.) All that said, the PLE side of things is currently a moot point, as KVM doesn't expose PLE to L1. Note, vmx_check_intercept() is still wildly broken when L1 wants to intercept an instruction, as KVM injects a #UD instead of synthesizing a nested VM-Exit. That issue extends far beyond NOP/PAUSE and needs far more effort to fix, i.e. is a problem for the future. Fixes: `07721feee4` ("KVM: nVMX: Don't emulate instructions in guest mode") Cc: Mathias Krause <minipli@grsecurity.net> Cc: stable@vger.kernel.org Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230405002359.418138-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-11 09:35:49 -07:00
Sean Christopherson	cf9f4c0eb1	KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for permission faults when emulating a guest memory access and CR0.WP may be guest owned. If the guest toggles only CR0.WP and triggers emulation of a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a stale CR0.WP, i.e. use stale protection bits metadata. Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't support per-bit interception controls for CR0. Don't bother checking for EPT vs. NPT as the "old == new" check will always be true under NPT, i.e. the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0 from the VMCB on VM-Exit). Reported-by: Mathias Krause <minipli@grsecurity.net> Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net Fixes: `fb509f76ac` ("KVM: VMX: Make CR0.WP a guest owned bit") Tested-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-10 15:25:36 -07:00
Sean Christopherson	9ed3bf4112	KVM: x86/mmu: Move filling of Hyper-V's TLB range struct into Hyper-V code Refactor Hyper-V's range-based TLB flushing API to take a gfn+nr_pages pair instead of a struct, and bury said struct in Hyper-V specific code. Passing along two params generates much better code for the common case where KVM is _not_ running on Hyper-V, as forwarding the flush on to Hyper-V's hv_flush_remote_tlbs_range() from kvm_flush_remote_tlbs_range() becomes a tail call. Cc: David Matlack <dmatlack@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230405003133.419177-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-10 15:17:29 -07:00
Sean Christopherson	8a1300ff95	KVM: x86: Rename Hyper-V remote TLB hooks to match established scheme Rename the Hyper-V hooks for TLB flushing to match the naming scheme used by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code, arch hooks from common code, etc. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-10 15:17:29 -07:00
Linus Torvalds	411eb01410	This pull request contains the following bug fix for UML: - Build regression with older gcc versions -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmQzG18WHHJpY2hhcmRA c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wQiRD/9KhMHHduPimvPF8RPpUj6aOMjZ XM8qRw1xIy+Jq7YcMchyx5Xq75vG+VSFDDsn2dZUe36469BPtzT/2vRfuhDHoLSZ g8446Omi567xwdyMbIB7UdZQQd3+HVaPTSaMhIV6aYhz2IU3/zx4g8frS39/zC1q NUdiyBTEOMl6vJi1lmEq8F9XFlb4N/7CuEGhQA9TQmPbcJ8PM1BqBtnf99gsvqgl G3ap+hZPnJ3sqXDR2HIwvs66RQMXQwOoZ0NW1FUygEsEmFkYVXSZPzbXEi3oG/Kd l5cSGIBNVPY53hD9bU5bSqxZu1jWlV5S3A7UDNQ0SZPnK3Eh8yOvHfdz0X8BSXKd +f/8EJmsHmJcKanQ4sWKUBKUA6q+8KdjDsfJnYzBvyiUbeHlfejYf2AFuo3kavtM 4erCLhgGw6hAXgNuEzygABwtMlnLaQ1dPARntPJPeG3G5bpGA05m9aqVNZnoRcWj 2FO53tFSJVJRmHfIbYkK9n22znRniA/gx63HnQKcmOCmlItDm/GdChfyJY+eSJ61 VyBm7tVg23A97oEtr0SlS+/iFp+LlxnP+g+MMxLC9P2//Alm9wkMoQ5CT7kAU9O5 bHlW7c8w1dDFsENXyrR92RC/dls4C7wNUnw3V2qPDz2rDySbQHt2ILM9lWegMxSG wNdhjmC5xRZ0anVQag== =lVvD -----END PGP SIGNATURE----- Merge tag 'uml-for-linus-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull UML fix from Richard Weinberger: - Build regression fix for older gcc versions * tag 'uml-for-linus-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: um: Only disable SSE on clang to work around old GCC bugs	2023-04-10 13:13:33 -07:00
Linus Torvalds	4ba115e269	- Add a new Intel Arrow Lake CPU model number - Fix a confusion about how to check the version of the ACPI spec which supports a "online capable" bit in the MADT table which lead to a bunch of boot breakages with Zen1 systems and VMs -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQymF8ACgkQEsHwGGHe VUr2cQ//YjJTZpeZ0/azfqz3IvUEW9etR8NXFfBOiATjk6BfZPtjZizdwZ2GtFtk /P3XPDxa34dW4ifLBCHdGpabYF5OWUj44hB2nyf3PxDSGLt6EjLtnxQm+WtobwWh uQ8BBA6QFw0kHPqH7TI5xA/Z4m6AvIBKZpAsG6GG/K3jVLYUgNoCSxplV7OvC0/M gobjMLdoofOXvuevgX6K311YiZY4hniVUSSNdOy4oDNwDFSguqVwr3wjVC7Hod8t rl5EzNcabhMmy4Hf5OmCcU1WQdM+XS5GbulPW0PMliuvFE+2ImqVyO9G41MG+Gw0 2qwhTsCA7R/Rm08eJh1ad5DmHdXc7M1ChC5s5ZTU25/9DDvRKxI7cHmIZxOxXCgy WJbP6LRw/9M7pBHatDjUTV7DdZnOkdsUK33dKWhiNZCB08EOVhcOVDq7EslMageH YhRmjpa2qfDpGOQpzwOZLgOpN3cz713QI7MBGFu0CBtbUR6ZKagQ0R/7As/vxuUo t5S8zMc/m+ra6til3MrP9QQvA0FA6Jo6H1DAZmglShtveO0eAGJCWaVbV8xppAIG DNGxk96OLJnigH0s0nAMHRQkJYM0yyKO02YkGbjGZhtLP53vDIiBqPD2PN5UJLO3 U4ucZmmet/iExApEvbTbQ7uWfCEHLLDc3Kidh0fdoTDIC5JKu2U= =Nk8H -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Add a new Intel Arrow Lake CPU model number - Fix a confusion about how to check the version of the ACPI spec which supports a "online capable" bit in the MADT table which lead to a bunch of boot breakages with Zen1 systems and VMs * tag 'x86_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add model number for Intel Arrow Lake processor x86/acpi/boot: Correct acpi_is_processor_usable() check x86/ACPI/boot: Use FADT version to check support for online capable	2023-04-09 10:00:16 -07:00
Bjorn Helgaas	7982722ff7	x86/kexec: remove unnecessary arch_kexec_kernel_image_load() Patch series "kexec: Remove unnecessary arch hook", v2. There are no arch-specific things in arch_kexec_kernel_image_load(), so remove it and just use the generic version. This patch (of 2): The x86 implementation of arch_kexec_kernel_image_load() is functionally identical to the generic arch_kexec_kernel_image_load(): arch_kexec_kernel_image_load # x86 if (!image->fops \|\| !image->fops->load) return ERR_PTR(-ENOEXEC); return image->fops->load(image, image->kernel_buf, ...) arch_kexec_kernel_image_load # generic kexec_image_load_default if (!image->fops \|\| !image->fops->load) return ERR_PTR(-ENOEXEC); return image->fops->load(image, image->kernel_buf, ...) Remove the x86-specific version and use the generic arch_kexec_kernel_image_load(). No functional change intended. Link: https://lkml.kernel.org/r/20230307224416.907040-1-helgaas@kernel.org Link: https://lkml.kernel.org/r/20230307224416.907040-2-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-08 13:45:38 -07:00
Alexey Dobriyan	70e79866ab	ELF: fix all "Elf" typos ELF is acronym and therefore should be spelled in all caps. I left one exception at Documentation/arm/nwfpe/nwfpe.rst which looks like being written in the first person. Link: https://lkml.kernel.org/r/Y/3wGWQviIOkyLJW@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-08 13:45:37 -07:00
Alyssa Ross	d83806c4c0	purgatory: fix disabling debug info Since `32ef9e5054`, -Wa,-gdwarf-2 is no longer used in KBUILD_AFLAGS. Instead, it includes -g, the appropriate -gdwarf-* flag, and also the -Wa versions of both of those if building with Clang and GNU as. As a result, debug info was being generated for the purgatory objects, even though the intention was that it not be. Fixes: `32ef9e5054` ("Makefile.debug: re-enable debug info for .S files") Signed-off-by: Alyssa Ross <hi@alyssa.is> Cc: stable@vger.kernel.org Acked-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2023-04-08 19:36:53 +09:00
Aaron Lewis	dfdeda67ea	KVM: x86/pmu: Prevent the PMU from counting disallowed events When counting "Instructions Retired" (0xc0) in a guest, KVM will occasionally increment the PMU counter regardless of if that event is being filtered. This is because some PMU events are incremented via kvm_pmu_trigger_event(), which doesn't know about the event filter. Add the event filter to kvm_pmu_trigger_event(), so events that are disallowed do not increment their counters. Fixes: `9cd803d496` ("KVM: x86: Update vPMCs when retiring instructions") Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230307141400.1486314-2-aaronlewis@google.com [sean: prepend "pmc" to the new function] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-07 09:24:16 -07:00
Like Xu	4fa5843d81	KVM: x86/pmu: Fix a typo in kvm_pmu_request_counter_reprogam() Fix a "reprogam" => "reprogram" typo in kvm_pmu_request_counter_reprogam(). Fixes: `68fb4757e8` ("KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()") Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230310113349.31799-1-likexu@tencent.com [sean: trim the changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-07 09:07:41 -07:00
Uros Bizjak	f96fb2df3e	x86/apic: Fix atomic update of offset in reserve_eilvt_offset() The detection of atomic update failure in reserve_eilvt_offset() is not correct. The value returned by atomic_cmpxchg() should be compared to the old value from the location to be updated. If these two are the same, then atomic update succeeded and "eilvt_offsets[offset]" location is updated to "new" in an atomic way. Otherwise, the atomic update failed and it should be retried with the value from "eilvt_offsets[offset]" - exactly what atomic_try_cmpxchg() does in a correct and more optimal way. Fixes: `a68c439b19` ("apic, x86: Check if EILVT APIC registers are available (AMD only)") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230227160917.107820-1-ubizjak@gmail.com	2023-04-07 14:34:24 +02:00
Like Xu	649bccd7fa	KVM: x86/pmu: Rewrite reprogram_counters() to improve performance A valid pmc is always tested before using pmu->reprogram_pmi. Eliminate this part of the redundancy by setting the counter's bitmask directly, and in addition, trigger KVM_REQ_PMU only once to save more cpu cycles. Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230214050757.9623-4-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 16:04:31 -07:00
Sean Christopherson	8bca8c5ce4	KVM: VMX: Refactor intel_pmu_{g,}set_msr() to align with other helpers Invert the flows in intel_pmu_{g,s}et_msr()'s case statements so that they follow the kernel's preferred style of: if (<not valid>) return <error> <commit change> return <success> which is also the style used by every other {g,s}et_msr() helper (except AMD's PMU variant, which doesn't use a switch statement). Modify the "set" paths with costly side effects, i.e. that reprogram counters, to skip only the side effects, i.e. to perform reserved bits checks even if the value is unchanged. None of the reserved bits checks are expensive, so there's no strong justification for skipping them, and guarding only the side effect makes it slightly more obvious what is being skipped and why. No functional change intended (assuming no reserved bit bugs). Link: https://lkml.kernel.org/r/Y%2B6cfen%2FCpO3%2FdLO%40google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 16:03:20 -07:00
Like Xu	cdd2fbf636	KVM: x86/pmu: Rename pmc_is_enabled() to pmc_is_globally_enabled() The name of function pmc_is_enabled() is a bit misleading. A PMC can be disabled either by PERF_CLOBAL_CTRL or by its corresponding EVTSEL. Append global semantics to its name. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230214050757.9623-2-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 15:57:19 -07:00
Sean Christopherson	957d0f70e9	KVM: x86/pmu: Zero out LBR capabilities during PMU refresh Zero out the LBR capabilities during PMU refresh to avoid exposing LBRs to the guest against userspace's wishes. If userspace modifies the guest's CPUID model or invokes KVM_CAP_PMU_CAPABILITY to disable vPMU after an initial KVM_SET_CPUID2, but before the first KVM_RUN, KVM will retain the previous LBR info due to bailing before refreshing the LBR descriptor. Note, this is a very theoretical bug, there is no known use case where a VMM would deliberately enable the vPMU via KVM_SET_CPUID2, and then later disable the vPMU. Link: https://lore.kernel.org/r/20230311004618.920745-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 14:58:43 -07:00
Sean Christopherson	3a6de51a43	KVM: x86/pmu: WARN and bug the VM if PMU is refreshed after vCPU has run Now that KVM disallows changing feature MSRs, i.e. PERF_CAPABILITIES, after running a vCPU, WARN and bug the VM if the PMU is refreshed after the vCPU has run. Note, KVM has disallowed CPUID updates after running a vCPU since commit `feb627e8d6` ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN"), i.e. PERF_CAPABILITIES was the only remaining way to trigger a PMU refresh after KVM_RUN. Cc: Like Xu <like.xu.linux@gmail.com> Link: https://lore.kernel.org/r/20230311004618.920745-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 14:58:43 -07:00
Sean Christopherson	0094f62c7e	KVM: x86: Disallow writes to immutable feature MSRs after KVM_RUN Disallow writes to feature MSRs after KVM_RUN to prevent userspace from changing the vCPU model after running the vCPU. Similar to guest CPUID, KVM uses feature MSRs to configure intercepts, determine what operations are/aren't allowed, etc. Changing the capabilities while the vCPU is active will at best yield unpredictable guest behavior, and at worst could be dangerous to KVM. Allow writing the current value, e.g. so that userspace can blindly set all MSRs when emulating RESET, and unconditionally allow writes to MSR_IA32_UCODE_REV so that userspace can emulate patch loads. Special case the VMX MSRs to keep the generic list small, i.e. so that KVM can do a linear walk of the generic list without incurring meaningful overhead. Cc: Like Xu <like.xu.linux@gmail.com> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 14:57:23 -07:00
Sean Christopherson	9eb6ba31db	KVM: x86: Generate set of VMX feature MSRs using first/last definitions Add VMX MSRs to the runtime list of feature MSRs by iterating over the range of emulated MSRs instead of manually defining each MSR in the "all" list. Using the range definition reduces the cost of emulating a new VMX MSR, e.g. prevents forgetting to add an MSR to the list. Extracting the VMX MSRs from the "all" list, which is a compile-time constant, also shrinks the list to the point where the compiler can heavily optimize code that iterates over the list. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 14:57:23 -07:00
Sean Christopherson	5757f5b956	KVM: x86: Add macros to track first...last VMX feature MSRs Add macros to track the range of VMX feature MSRs that are emulated by KVM to reduce the maintenance cost of extending the set of emulated MSRs. Note, KVM doesn't necessarily emulate all known/consumed VMX MSRs, e.g. PROCBASED_CTLS3 is consumed by KVM to enable IPI virtualization, but is not emulated as KVM doesn't emulate/virtualize IPI virtualization for nested guests. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 14:57:22 -07:00
Sean Christopherson	fb3146b4dc	KVM: x86: Add a helper to query whether or not a vCPU has ever run Add a helper to query if a vCPU has run so that KVM doesn't have to open code the check on last_vmentry_cpu being set to a magic value. No functional change intended. Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Like Xu <like.xu.linux@gmail.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 14:57:22 -07:00
Sean Christopherson	b1932c5c19	KVM: x86: Rename kvm_init_msr_list() to clarify it inits multiple lists Rename kvm_init_msr_list() to kvm_init_msr_lists() to clarify that it initializes multiple lists: MSRs to save, emulated MSRs, and feature MSRs. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 14:57:22 -07:00
Basavaraj Natikar	f195fc1e97	x86/PCI: Add quirk for AMD XHCI controller that loses MSI-X state in D3hot The AMD [1022:15b8] USB controller loses some internal functional MSI-X context when transitioning from D0 to D3hot. BIOS normally traps D0->D3hot and D3hot->D0 transitions so it can save and restore that internal context, but some firmware in the field can't do this because it fails to clear the AMD_15B8_RCC_DEV2_EPF0_STRAP2 NO_SOFT_RESET bit. Clear AMD_15B8_RCC_DEV2_EPF0_STRAP2 NO_SOFT_RESET bit before USB controller initialization during boot. Link: https://lore.kernel.org/linux-usb/Y%2Fz9GdHjPyF2rNG3@glanzmann.de/T/#u Link: https://lore.kernel.org/r/20230329172859.699743-1-Basavaraj.Natikar@amd.com Reported-by: Thomas Glanzmann <thomas@glanzmann.de> Tested-by: Thomas Glanzmann <thomas@glanzmann.de> Signed-off-by: Basavaraj Natikar <Basavaraj.Natikar@amd.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Cc: stable@vger.kernel.org	2023-04-06 16:23:59 -05:00
Kirill A. Shutemov	97740266de	x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside arch_prctl(ARCH_FORCE_TAGGED_SVA) overrides the default and allows LAM and SVA to co-exist in the process. It is expected by called by the process when it knows what it is doing. arch_prctl() operates on the current process, but the same code is reachable from ptrace where it can be called on arbitrary task. Make it strict and only allow to set MM_CONTEXT_FORCE_TAGGED_SVA for the current process. Fixes: `23e5d9ec2b` ("x86/mm/iommu/sva: Make LAM and SVA mutually exclusive") Suggested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/all/20230403111020.3136-3-kirill.shutemov%40linux.intel.com	2023-04-06 13:45:06 -07:00
Kirill A. Shutemov	fca1fdd2b0	x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA Normally, LAM and SVA are mutually exclusive. LAM enabling will fail if SVA is already in use. Correct error code for the failure. EINTR is nonsensical there. Fixes: `23e5d9ec2b` ("x86/mm/iommu/sva: Make LAM and SVA mutually exclusive") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/all/CACT4Y+YfqSMsZArhh25TESmG-U4jO5Hjphz87wKSnTiaw2Wrfw@mail.gmail.com Link: https://lore.kernel.org/all/20230403111020.3136-2-kirill.shutemov%40linux.intel.com	2023-04-06 13:44:58 -07:00
Sean Christopherson	400d213228	KVM: SVM: Return the local "r" variable from svm_set_msr() Rename "r" to "ret" and actually return it from svm_set_msr() to reduce the probability of repeating the mistake of commit `723d5fb0ff` ("kvm: svm: Add IA32_FLUSH_CMD guest support"), which set "r" thinking that it would be propagated to the caller. Alternatively, the declaration of "r" could be moved into the handling of MSR_TSC_AUX, but that risks variable shadowing in the future. A wrapper for kvm_set_user_return_msr() would allow eliding a local variable, but that feels like delaying the inevitable. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322011440.2195485-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-04-06 13:37:37 -04:00
Sean Christopherson	da3db168fb	KVM: x86: Virtualize FLUSH_L1D and passthrough MSR_IA32_FLUSH_CMD Virtualize FLUSH_L1D so that the guest can use the performant L1D flush if one of the many mitigations might require a flush in the guest, e.g. Linux provides an option to flush the L1D when switching mms. Passthrough MSR_IA32_FLUSH_CMD for write when it's supported in hardware and exposed to the guest, i.e. always let the guest write it directly if FLUSH_L1D is fully supported. Forward writes to hardware in host context on the off chance that KVM ends up emulating a WRMSR, or in the really unlikely scenario where userspace wants to force a flush. Restrict these forwarded WRMSRs to the known command out of an abundance of caution. Passing through the MSR means the guest can throw any and all values at hardware, but doing so in host context is arguably a bit more dangerous. Link: https://lkml.kernel.org/r/CALMp9eTt3xzAEoQ038bJQ9LN0ZOXrSWsN7xnNUD%2B0SS%3DWwF7Pg%40mail.gmail.com Link: https://lore.kernel.org/all/20230201132905.549148-2-eesposit@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322011440.2195485-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-04-06 13:37:37 -04:00
Sean Christopherson	903358c7ed	KVM: x86: Move MSR_IA32_PRED_CMD WRMSR emulation to common code Dedup the handling of MSR_IA32_PRED_CMD across VMX and SVM by moving the logic to kvm_set_msr_common(). Now that the MSR interception toggling is handled as part of setting guest CPUID, the VMX and SVM paths are identical. Opportunistically massage the code to make it a wee bit denser. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20230322011440.2195485-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-04-06 13:37:36 -04:00
Sean Christopherson	bff903e8cd	KVM: SVM: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID Passthrough MSR_IA32_PRED_CMD based purely on whether or not the MSR is supported and enabled, i.e. don't wait until the first write. There's no benefit to deferred passthrough, and the extra logic only adds complexity. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322011440.2195485-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-04-06 13:37:36 -04:00
Sean Christopherson	9a4c485013	KVM: VMX: Passthrough MSR_IA32_PRED_CMD based purely on host+guest CPUID Passthrough MSR_IA32_PRED_CMD based purely on whether or not the MSR is supported and enabled, i.e. don't wait until the first write. There's no benefit to deferred passthrough, and the extra logic only adds complexity. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20230322011440.2195485-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-04-06 13:37:36 -04:00
Sean Christopherson	52887af565	KVM: x86: Revert MSR_IA32_FLUSH_CMD.FLUSH_L1D enabling Revert the recently added virtualizing of MSR_IA32_FLUSH_CMD, as both the VMX and SVM are fatally buggy to guests that use MSR_IA32_FLUSH_CMD or MSR_IA32_PRED_CMD, and because the entire foundation of the logic is flawed. The most immediate problem is an inverted check on @cmd that results in rejecting legal values. SVM doubles down on bugs and drops the error, i.e. silently breaks all guest mitigations based on the command MSRs. The next issue is that neither VMX nor SVM was updated to mark MSR_IA32_FLUSH_CMD as being a possible passthrough MSR, which isn't hugely problematic, but does break MSR filtering and triggers a WARN on VMX designed to catch this exact bug. The foundational issues stem from the MSR_IA32_FLUSH_CMD code reusing logic from MSR_IA32_PRED_CMD, which in turn was likely copied from KVM's support for MSR_IA32_SPEC_CTRL. The copy+paste from MSR_IA32_SPEC_CTRL was misguided as MSR_IA32_PRED_CMD (and MSR_IA32_FLUSH_CMD) is a write-only MSR, i.e. doesn't need the same "deferred passthrough" shenanigans as MSR_IA32_SPEC_CTRL. Revert all MSR_IA32_FLUSH_CMD enabling in one fell swoop so that there is no point where KVM advertises, but does not support, L1D_FLUSH. This reverts commits `45cf86f261`, `723d5fb0ff`, and `a807b78ad0`. Reported-by: Nathan Chancellor <nathan@kernel.org> Link: https://lkml.kernel.org/r/20230317190432.GA863767%40dev-arch.thelio-3990X Cc: Emanuele Giuseppe Esposito <eesposit@redhat.com> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Message-Id: <20230322011440.2195485-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-04-06 13:37:35 -04:00
Suren Baghdasaryan	0bff0aaea0	x86/mm: try VMA lock-based page fault handling first Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. Link: https://lkml.kernel.org/r/20230227173632.3292573-30-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-05 20:03:01 -07:00
Sean Christopherson	098f4c061e	KVM: x86/pmu: Disallow legacy LBRs if architectural LBRs are available Disallow enabling LBR support if the CPU supports architectural LBRs. Traditional LBR support is absent on CPU models that have architectural LBRs, and KVM doesn't yet support arch LBRs, i.e. KVM will pass through non-existent MSRs if userspace enables LBRs for the guest. Cc: stable@vger.kernel.org Cc: Yang Weijiang <weijiang.yang@intel.com> Cc: Like Xu <like.xu.linux@gmail.com> Reported-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: `be635e34c2` ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES") Tested-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230128001427.2548858-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-05 17:00:17 -07:00
Tom Rix	944a8dad8b	KVM: x86: set "mitigate_smt_rsb" storage-class-specifier to static smatch reports arch/x86/kvm/x86.c:199:20: warning: symbol 'mitigate_smt_rsb' was not declared. Should it be static? This variable is only used in one file so it should be static. Signed-off-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/20230404010141.1913667-1-trix@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-05 16:47:35 -07:00
Like Xu	7e768ce827	KVM: x86/pmu: Zero out pmu->all_valid_pmc_idx each time it's refreshed The kvm_pmu_refresh() may be called repeatedly (e.g. configure guest CPUID repeatedly or update MSR_IA32_PERF_CAPABILITIES) and each call will use the last pmu->all_valid_pmc_idx value, with the residual bits introducing additional overhead later in the vPMU emulation. Fixes: `b35e5548b4` ("KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230404071759.75376-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-05 16:33:10 -07:00
Niklas Schnelle	fcbfe8121a	Kconfig: introduce HAS_IOPORT option and select it as necessary We introduce a new HAS_IOPORT Kconfig option to indicate support for I/O Port access. In a future patch HAS_IOPORT=n will disable compilation of the I/O accessor functions inb()/outb() and friends on architectures which can not meaningfully support legacy I/O spaces such as s390. The following architectures do not select HAS_IOPORT: * ARC * C-SKY * Hexagon * Nios II * OpenRISC * s390 * User-Mode Linux * Xtensa All other architectures select HAS_IOPORT at least conditionally. The "depends on" relations on HAS_IOPORT in drivers as well as ifdefs for HAS_IOPORT specific sections will be added in subsequent patches on a per subsystem basis. Co-developed-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Arnd Bergmann <arnd@kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> # for ARCH=um Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2023-04-05 22:15:19 +02:00
Binbin Wu	548bd27428	KVM: VMX: Use is_64_bit_mode() to check 64-bit mode in SGX handler sgx_get_encls_gva() uses is_long_mode() to check 64-bit mode, however, SGX system leaf instructions are valid in compatibility mode, should use is_64_bit_mode() instead. Fixes: `70210c044b` ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions") Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230404032502.27798-1-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-05 11:52:19 -07:00
Tony Luck	36168bc061	x86/cpu: Add Xeon Emerald Rapids to list of CPUs that support PPIN This should be the last addition to this table. Future CPUs will enumerate PPIN support using CPUID. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230404212124.428118-1-tony.luck@intel.com	2023-04-05 20:01:52 +02:00
Paul E. McKenney	d276134ed4	arch/x86: Remove "select SRCU" Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-04-05 13:47:42 +00:00
Paul E. McKenney	79cf833be6	kvm: Remove "select SRCU" Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements from the various KVM Kconfig files. Acked-by: Sean Christopherson <seanjc@google.com> (x86) Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <kvm@vger.kernel.org> Acked-by: Marc Zyngier <maz@kernel.org> (arm64) Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Anup Patel <anup@brainfault.org> (riscv) Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390) Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-04-05 13:47:42 +00:00
Tony Luck	81515ecf15	x86/cpu: Add model number for Intel Arrow Lake processor Successor to Lunar Lake. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230404174641.426593-1-tony.luck@intel.com	2023-04-05 13:36:26 +02:00
Oliver Upton	e65733b5c5	KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL The 'longmode' field is a bit annoying as it blows an entire __u32 to represent a boolean value. Since other architectures are looking to add support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it up. Redefine the field (and the remaining padding) as a set of flags. Preserve the existing ABI by using bit 0 to indicate if the guest was in long mode and requiring that the remaining 31 bits must be zero. Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev	2023-04-05 12:07:41 +01:00
Kirill A. Shutemov	5462ade687	x86/boot: Centralize __pa()/__va() definitions Replace multiple __pa()/__va() definitions with a single one in misc.h. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230330114956.20342-2-kirill.shutemov%40linux.intel.com	2023-04-04 13:42:37 -07:00
Vipin Sharma	40fa907e5a	KVM: x86/mmu: Merge all handle_changed_pte*() functions Merge __handle_changed_pte() and handle_changed_spte_acc_track() into a single function, handle_changed_pte(), as the two are always used together. Remove the existing handle_changed_pte(), as it's just a wrapper that calls __handle_changed_pte() and handle_changed_spte_acc_track(). Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:31 -07:00
Vipin Sharma	1f9973456e	KVM: x86/mmu: Remove handle_changed_spte_dirty_log() Remove handle_changed_spte_dirty_log() as there is no code flow which sets 4KiB SPTE writable and hit this path. This function marks the page dirty in a memslot only if new SPTE is 4KiB in size and writable. Current users of handle_changed_spte_dirty_log() are: 1. set_spte_gfn() - Create only non writable SPTEs. 2. write_protect_gfn() - Change an SPTE to non writable. 3. zap leaf and roots APIs - Everything is 0. 4. handle_removed_pt() - Sets SPTEs to REMOVED_SPTE 5. tdp_mmu_link_sp() - Makes non leaf SPTEs. There is also no path which creates a writable 4KiB without going through make_spte() and this functions takes care of marking SPTE dirty in the memslot if it is PT_WRITABLE. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: add blurb to __handle_changed_spte()'s comment] Link: https://lore.kernel.org/r/20230321220021.2119033-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	0b7cc2547d	KVM: x86/mmu: Remove "record_acc_track" in __tdp_mmu_set_spte() Remove bool parameter "record_acc_track" from __tdp_mmu_set_spte() and refactor the code. This variable is always set to true by its caller. Remove single and double underscore prefix from tdp_mmu_set_spte() related APIs: 1. Change __tdp_mmu_set_spte() to tdp_mmu_set_spte() 2. Change _tdp_mmu_set_spte() to tdp_mmu_iter_set_spte() Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230321220021.2119033-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	891f115960	KVM: x86/mmu: Bypass __handle_changed_spte() when aging TDP MMU SPTEs Drop everything except the "tdp_mmu_spte_changed" tracepoint part of __handle_changed_spte() when aging SPTEs in the TDP MMU, as clearing the accessed status doesn't affect the SPTE's shadow-present status, whether or not the SPTE is a leaf, or change the PFN. I.e. none of the functional updates handled by __handle_changed_spte() are relevant. Losing __handle_changed_spte()'s sanity checks does mean that a bug could theoretical go unnoticed, but that scenario is extremely unlikely, e.g. would effectively require a misconfigured MMU or a locking bug elsewhere. Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	6141df067d	KVM: x86/mmu: Drop unnecessary dirty log checks when aging TDP MMU SPTEs Drop the unnecessary call to handle dirty log updates when aging TDP MMU SPTEs, as neither clearing the Accessed bit nor marking a SPTE for access tracking can _set_ the Writable bit, i.e. can't trigger marking a gfn dirty in its memslot. The access tracking path can _clear_ the Writable bit, e.g. if the XCHG races with fast_page_fault() and writes the stale value without the Writable bit set, but clearing the Writable bit outside of mmu_lock is not allowed, i.e. access tracking can't spuriously set the Writable bit. Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, apply to dirty path, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	7ee131e3a3	KVM: x86/mmu: Clear only A-bit (if enabled) when aging TDP MMU SPTEs Use tdp_mmu_clear_spte_bits() when clearing the Accessed bit in TDP MMU SPTEs so as to use an atomic-AND instead of XCHG to clear the A-bit. Similar to the D-bit story, this will allow KVM to bypass __handle_changed_spte() by ensuring only the A-bit is modified. Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	e73008705d	KVM: x86/mmu: Remove "record_dirty_log" in __tdp_mmu_set_spte() Remove bool parameter "record_dirty_log" from __tdp_mmu_set_spte() and refactor the code as this variable is always set to true by its caller. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230321220021.2119033-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	1e0f42985f	KVM: x86/mmu: Bypass __handle_changed_spte() when clearing TDP MMU dirty bits Drop everything except marking the PFN dirty and the relevant tracepoint parts of __handle_changed_spte() when clearing the dirty status of gfns in the TDP MMU. Clearing only the Dirty (or Writable) bit doesn't affect the SPTEs shadow-present status, whether or not the SPTE is a leaf, or change the SPTE's PFN. I.e. other than marking the PFN dirty, none of the functional updates handled by __handle_changed_spte() are relevant. Losing __handle_changed_spte()'s sanity checks does mean that a bug could theoretical go unnoticed, but that scenario is extremely unlikely, e.g. would effectively require a misconfigured or a locking bug elsewhere. Opportunistically remove a comment blurb from __handle_changed_spte() about all modifications to TDP MMU SPTEs needing to invoke said function, that "rule" hasn't been true since fast page fault support was added for the TDP MMU (and perhaps even before). Tested on a VM (160 vCPUs, 160 GB memory) and found that performance of clear dirty log stage improved by ~40% in dirty_log_perf_test (with the full optimization applied). Before optimization: -------------------- Iteration 1 clear dirty log time: 3.638543593s Iteration 2 clear dirty log time: 3.145032742s Iteration 3 clear dirty log time: 3.142340358s Clear dirty log over 3 iterations took 9.925916693s. (Avg 3.308638897s/iteration) After optimization: ------------------- Iteration 1 clear dirty log time: 2.318988110s Iteration 2 clear dirty log time: 1.794470164s Iteration 3 clear dirty log time: 1.791668628s Clear dirty log over 3 iterations took 5.905126902s. (Avg 1.968375634s/iteration) Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: split the switch to atomic-AND to a separate patch] Link: https://lore.kernel.org/r/20230321220021.2119033-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	cf05e8c732	KVM: x86/mmu: Drop access tracking checks when clearing TDP MMU dirty bits Drop the unnecessary call to handle access-tracking changes when clearing the dirty status of TDP MMU SPTEs. Neither the Dirty bit nor the Writable bit has any impact on the accessed state of a page, i.e. clearing only the aforementioned bits doesn't make an accessed SPTE suddently not accessed. Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	89c313f20c	KVM: x86/mmu: Atomically clear SPTE dirty state in the clear-dirty-log flow Optimize the clearing of dirty state in TDP MMU SPTEs by doing an atomic-AND (on SPTEs that have volatile bits) instead of the full XCHG that currently ends up being invoked (see kvm_tdp_mmu_write_spte()). Clearing _only_ the bit in question will allow KVM to skip the many irrelevant checks in __handle_changed_spte() by avoiding any collateral damage due to the XCHG writing all SPTE bits, e.g. the XCHG could race with fast_page_fault() setting the W-bit and the CPU setting the D-bit, and thus incorrectly drop the CPU's D-bit update. Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: split the switch to atomic-AND to a separate patch] Link: https://lore.kernel.org/r/20230321220021.2119033-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	697c89bed9	KVM: x86/mmu: Consolidate Dirty vs. Writable clearing logic in TDP MMU Deduplicate the guts of the TDP MMU's clearing of dirty status by snapshotting whether to check+clear the Dirty bit vs. the Writable bit, which is the only difference between the two flavors of dirty tracking. Note, kvm_ad_enabled() is just a wrapper for shadow_accessed_mask, i.e. is constant after kvm-{intel,amd}.ko is loaded. Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, apply to dirty log, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	5982a53926	KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot Use the constant-after-module-load kvm_ad_enabled() to check if SPTEs in the TDP MMU need to be write-protected when clearing accessed/dirty status instead of manually checking every SPTE. The per-SPTE A/D enabling is specific to nested EPT MMUs, i.e. when KVM is using EPT A/D bits but L1 is not, and so cannot happen in the TDP MMU (which is non-nested only). Keep the original code as sanity checks buried under MMU_WARN_ON(). MMU_WARN_ON() is more or less useless at the moment, but there are plans to change that. Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, apply to dirty path, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Vipin Sharma	41e07665f1	KVM: x86/mmu: Add a helper function to check if an SPTE needs atomic write Move conditions in kvm_tdp_mmu_write_spte() to check if an SPTE should be written atomically or not to a separate function. This new function, kvm_tdp_mmu_spte_need_atomic_write(), will be used in future commits to optimize clearing bits in SPTEs. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Link: https://lore.kernel.org/r/20230321220021.2119033-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 12:37:30 -07:00
Linus Torvalds	76f598ba7d	PPC: * Hide KVM_CAP_IRQFD_RESAMPLE if XIVE is enabled s390: * Fix handling of external interrupts in protected guests x86: * Resample the pending state of IOAPIC interrupts when unmasking them * Fix usage of Hyper-V "enlightened TLB" on AMD * Small fixes to real mode exceptions * Suppress pending MMIO write exits if emulator detects exception Documentation: * Fix rST syntax -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmQsXMQUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOc6Qf/YMDcYxpPuYFZ69t9k4s0ubeGjGpr Qdzo6ZgKbfs4+nu89u2F5zvAftRSrEVQRpfLwmTXauM8el3s+EAEPlK/MLH6sSSo LG3eWoNxs7MUQ6gBWeQypY6opvz3W9hf8KYd+GY6TV7jwV5/YgH6674Yyd7T/bCm 0/tUiil5Q7Da0w4GUY5m9gEqlbA9yjDBFySqqTZemo18NzLaPNpesKnm1hvv0QJG JRvOMmHwqAZgS0G5cyGEPZKY6Ga0v9QsAXc8mFC/Jtn9GMD2f47PYDRfd/7NSiVz ulgsk8zZkZIZ6nMbQ1t10txmGIbFmJ4iPPaATrMII/xHq+idgAlHU7zY1A== =QvFj -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "PPC: - Hide KVM_CAP_IRQFD_RESAMPLE if XIVE is enabled s390: - Fix handling of external interrupts in protected guests x86: - Resample the pending state of IOAPIC interrupts when unmasking them - Fix usage of Hyper-V "enlightened TLB" on AMD - Small fixes to real mode exceptions - Suppress pending MMIO write exits if emulator detects exception Documentation: - Fix rST syntax" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: docs: kvm: x86: Fix broken field list KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent KVM: s390: pv: fix external interruption loop not always detected KVM: nVMX: Do not report error code when synthesizing VM-Exit from Real Mode KVM: x86: Clear "has_error_code", not "error_code", for RM exception injection KVM: x86: Suppress pending MMIO write exits if emulator detects exception KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking KVM: irqfd: Make resampler_list an RCU list KVM: SVM: Flush Hyper-V TLB when required	2023-04-04 11:29:37 -07:00
Xinghui Li	c0d0ce9b5a	KVM: SVM: Remove a duplicate definition of VMCB_AVIC_APIC_BAR_MASK VMCB_AVIC_APIC_BAR_MASK is defined twice with the same value in svm.h, which is meaningless. Delete the duplicate one. Fixes: `3915035282` ("KVM: x86: SVM: move avic definitions from AMD's spec to svm.h") Signed-off-by: Xinghui Li <korantli@tencent.com> Reviewed-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230403095200.1391782-1-korantwork@gmail.com [sean: tweak shortlog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-04 11:08:12 -07:00
David Gow	a3046a618a	um: Only disable SSE on clang to work around old GCC bugs As part of the Rust support for UML, we disable SSE (and similar flags) to match the normal x86 builds. This both makes sense (we ideally want a similar configuration to x86), and works around a crash bug with SSE generation under Rust with LLVM. However, this breaks compiling stdlib.h under gcc < 11, as the x86_64 ABI requires floating-point return values be stored in an SSE register. gcc 11 fixes this by only doing register allocation when a function is actually used, and since we never use atof(), it shouldn't be a problem: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99652 Nevertheless, only disable SSE on clang setups, as that's a simple way of working around everyone's bugs. Fixes: `8849818679` ("rust: arch/um: Disable FP/SIMD instruction to match x86") Reported-by: Roberto Sassu <roberto.sassu@huaweicloud.com> Link: https://lore.kernel.org/linux-um/6df2ecef9011d85654a82acd607fdcbc93ad593c.camel@huaweicloud.com/ Tested-by: Roberto Sassu <roberto.sassu@huaweicloud.com> Tested-by: SeongJae Park <sj@kernel.org> Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com> Tested-by: Arthur Grillo <arthurgrillo@riseup.net> Signed-off-by: Richard Weinberger <richard@nod.at>	2023-04-04 09:57:05 +02:00
Linus Torvalds	2d72ab2449	hyperv-fixes for 6.3-rc6 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmQqIpUTHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXjtwCACaG8LkrLOa4EWwdVLOutxc/VSHhPzS FaCyzxaSNtFciSl/kOPsl2pmwy+c9QAri3wO9uyJ41R1oUfjy/+pX8TxYc1imOrh 6vIMUntYW7t9ISoUbi7hDU1Nj3CX4KOXruOliLP3WM9mtGvaNL5INEDh9PV6bxIz xlP8JEoKTk0ecChOWZDWyDIE95MwgqRin8uEI0JUyE2mdegIrDC7SFvqT7XjV23O 0gntPdoZCgBzWohaiRMKJHHNUbAC+1O2+1tzY0bONwHdpmRj5/V28e02iARF3bAE 4TvTt3qrZU02epzMhkZPnTztyvp1vPzmpHaBHQD4pdNZP/D1b8ejm4mz =o+VM -----END PGP SIGNATURE----- Merge tag 'hyperv-fixes-signed-20230402' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Fix a bug in channel allocation for VMbus (Mohammed Gamal) - Do not allow root partition functionality in CVM (Michael Kelley) * tag 'hyperv-fixes-signed-20230402' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Block root partition functionality in a Confidential VM Drivers: vmbus: Check for channel allocation before looking up relids	2023-04-03 09:34:08 -07:00
Greg Kroah-Hartman	cd8fe5b6db	Merge 6.3-rc5 into driver-core-next We need the fixes in here for testing, as well as the driver core changes for documentation updates to build on. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-04-03 09:33:30 +02:00
Alexey Kardashevskiy	52882b9c7a	KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent When introduced, IRQFD resampling worked on POWER8 with XICS. However KVM on POWER9 has never implemented it - the compatibility mode code ("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native XIVE mode does not handle INTx in KVM at all. This moved the capability support advertising to platforms and stops advertising it on XIVE, i.e. POWER9 and later. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20220504074807.3616813-1-aik@ozlabs.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-31 11:19:05 -04:00
Paolo Bonzini	85b475a450	A small fix that repairs the external loop detection code for PV guests. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmQilhMACgkQ41TmuOI4 ufhydRAAgldhL63tLa9PChGUkRtQLRdWKC8qJXIZUriF5d3fvQD8nfTRSvndwPEj ELg+MO63xhZEqyje4JVjNISc8sHXTZqFk43L+L2uamX2tlbw8ELAtiCTLZHKuCzz T3+vNB6Q103C94ZxsIhxnlhXAw69K0yv7H3P/SytBx6iDCvQ2k+BdAuSKylTytcU lfOGK3ADuvaL9PJSyM1ICCKqgdveEBI+tHo9h/eqgM8SmSNF1OahhjyDiiWtWTT6 EpaTkce0fJypvCShCOMBnSHdPS6QZ4JclNdcDR6uz4VMT28acV6lNGRDkPYgBzta Wnx6+CETyePJt6NU5Nh8XLXstKQ7OUvCUSWNligkRBYYl0pr7jSA8u8xuGCaCCDs 8ZOkMWcYLYKqH3Cvqb2rMR2MM9psrHwtLflC+38hGMV23Iy8YBnAQv/AixSy5yoP J50O1IlnNVhjsRuNxVAEwaybZP3pY11/iEwrhOVDyjdKlZGmbA0lK705bTh5z5YE XXGYDe97sPvztfrpPwDWdqoJPPDvxRvzrXROxhNK4vo5D/qGzobrvxkEkDcrAVqQ yoOj46zk/cieSCkiySURcQ+3qLYi/8+HeOtQcVMApK91GsselzM6vwSuKqFUmicq XrgsS7Gj2VNk4IMLLgevqQWfwvRRhNs/FiUI6lDd+wqqhX09Vl0= =Ysma -----END PGP SIGNATURE----- Merge tag 'kvm-s390-master-6.3-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD A small fix that repairs the external loop detection code for PV guests.	2023-03-31 11:15:09 -04:00
Jacob Pan	fffaed1e24	iommu/ioasid: Rename INVALID_IOASID INVALID_IOASID and IOMMU_PASID_INVALID are duplicated. Rename INVALID_IOASID and consolidate since we are moving away from IOASID infrastructure. Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Link: https://lore.kernel.org/r/20230322200803.869130-7-jacob.jun.pan@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-03-31 10:03:27 +02:00
Jonathan Corbet	ff61f0791c	docs: move x86 documentation into Documentation/arch/ Move the x86 documentation under Documentation/arch/ as a way of cleaning up the top-level directory and making the structure of our docs more closely match the structure of the source directories it describes. All in-kernel references to the old paths have been updated. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20230315211523.108836-1-corbet@lwn.net/ Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2023-03-30 12:58:51 -06:00
Thomas Gleixner	5b422b9bb7	Merge branch 'x86/cc' into x86/apic Pick up the cc_vendor changes.	2023-03-30 14:10:31 +02:00
Thomas Gleixner	87efe38410	Merge branch 'x86/cc' into x86/sev Pick up the cc_vendor changes.	2023-03-30 14:07:05 +02:00
Borislav Petkov (AMD)	3d91c53729	x86/coco: Export cc_vendor It will be used in different checks in future changes. Export it directly and provide accessor functions and stubs so this can be used in general code when CONFIG_ARCH_HAS_CC_PLATFORM is not set. No functional changes. [ tglx: Add accessor functions ] Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230318115634.9392-2-bp@alien8.de	2023-03-30 14:06:28 +02:00
Eric DeVolder	fed8d8773b	x86/acpi/boot: Correct acpi_is_processor_usable() check The logic in acpi_is_processor_usable() requires the online capable bit be set for hotpluggable CPUs. The online capable bit has been introduced in ACPI 6.3. However, for ACPI revisions < 6.3 which do not support that bit, CPUs should be reported as usable, not the other way around. Reverse the check. [ bp: Rewrite commit message. ] Fixes: `e2869bd7af` ("x86/acpi/boot: Do not register processors that cannot be onlined for x2APIC") Suggested-by: Miguel Luis <miguel.luis@oracle.com> Suggested-by: Boris Ostrovsky <boris.ovstrosky@oracle.com> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: David R <david@unsolicited.net> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230327191026.3454-2-eric.devolder@oracle.com	2023-03-30 11:07:30 +02:00
Mario Limonciello	a74fabfbd1	x86/ACPI/boot: Use FADT version to check support for online capable ACPI 6.3 introduced the online capable bit, and also introduced MADT version 5. Latter was used to distinguish whether the offset storing online capable could be used. However ACPI 6.2b has MADT version "45" which is for an errata version of the ACPI 6.2 spec. This means that the Linux code for detecting availability of MADT will mistakenly flag ACPI 6.2b as supporting online capable which is inaccurate as it's an ACPI 6.3 feature. Instead use the FADT major and minor revision fields to distinguish this. [ bp: Massage. ] Fixes: `aa06e20f1b` ("x86/ACPI: Don't add CPUs that are not online capable") Reported-by: Eric DeVolder <eric.devolder@oracle.com> Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/943d2445-84df-d939-f578-5d8240d342cc@unsolicited.net	2023-03-30 10:50:30 +02:00
Gerald Schaefer	99c2913363	mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault() s390 can do more fine-grained handling of spurious TLB protection faults, when there also is the PTE pointer available. Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an additional parameter. This will add no functional change to other architectures, but those with private flush_tlb_fix_spurious_fault() implementations need to be made aware of the new parameter. Link: https://lkml.kernel.org/r/20230306161548.661740-1-gerald.schaefer@linux.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:12 -07:00
Alexander Potapenko	27f644dc5a	x86: kmsan: use C versions of memset16/memset32/memset64 KMSAN must see as many memory accesses as possible to prevent false positive reports. Fall back to versions of memset16()/memset32()/memset64() implemented in lib/string.c instead of those written in assembly. Link: https://lkml.kernel.org/r/20230303141433.3422671-3-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reviewed-by: Marco Elver <elver@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:11 -07:00
Alexander Potapenko	6dc4bd4e2f	x86: kmsan: don't rename memintrinsics in uninstrumented files clang -fsanitize=kernel-memory already replaces calls to memset/memcpy/memmove and their __builtin_ versions with __msan_memset/__msan_memcpy/__msan_memmove in instrumented files, so there is no need to override them. In non-instrumented versions we are now required to leave memset() and friends intact, so we cannot replace them with __msan_XXX() functions. Link: https://lkml.kernel.org/r/20230303141433.3422671-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:11 -07:00
Ma Wupeng	d155df53f3	x86/mm/pat: clear VM_PAT if copy_p4d_range failed Syzbot reports a warning in untrack_pfn(). Digging into the root we found that this is due to memory allocation failure in pmd_alloc_one. And this failure is produced due to failslab. In copy_page_range(), memory alloaction for pmd failed. During the error handling process in copy_page_range(), mmput() is called to remove all vmas. While untrack_pfn this empty pfn, warning happens. Here's a simplified flow: dup_mm dup_mmap copy_page_range copy_p4d_range copy_pud_range copy_pmd_range pmd_alloc __pmd_alloc pmd_alloc_one page = alloc_pages(gfp, 0); if (!page) return NULL; mmput exit_mmap unmap_vmas unmap_single_vma untrack_pfn follow_phys WARN_ON_ONCE(1); Since this vma is not generate successfully, we can clear flag VM_PAT. In this case, untrack_pfn() will not be called while cleaning this vma. Function untrack_pfn_moved() has also been renamed to fit the new logic. Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:07 -07:00
Sean Christopherson	80962ec912	KVM: nVMX: Do not report error code when synthesizing VM-Exit from Real Mode Don't report an error code to L1 when synthesizing a nested VM-Exit and L2 is in Real Mode. Per Intel's SDM, regarding the error code valid bit: This bit is always 0 if the VM exit occurred while the logical processor was in real-address mode (CR0.PE=0). The bug was introduced by a recent fix for AMD's Paged Real Mode, which moved the error code suppression from the common "queue exception" path to the "inject exception" path, but missed VMX's "synthesize VM-Exit" path. Fixes: `b97f074583` ("KVM: x86: determine if an exception has an error code only when injecting it.") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322143300.2209476-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-27 10:15:11 -04:00
Sean Christopherson	6c41468c7c	KVM: x86: Clear "has_error_code", not "error_code", for RM exception injection When injecting an exception into a vCPU in Real Mode, suppress the error code by clearing the flag that tracks whether the error code is valid, not by clearing the error code itself. The "typo" was introduced by recent fix for SVM's funky Paged Real Mode. Opportunistically hoist the logic above the tracepoint so that the trace is coherent with respect to what is actually injected (this was also the behavior prior to the buggy commit). Fixes: `b97f074583` ("KVM: x86: determine if an exception has an error code only when injecting it.") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322143300.2209476-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-27 10:15:10 -04:00
Sean Christopherson	0dc902267c	KVM: x86: Suppress pending MMIO write exits if emulator detects exception Clear vcpu->mmio_needed when injecting an exception from the emulator to squash a (legitimate) warning about vcpu->mmio_needed being true at the start of KVM_RUN without a callback being registered to complete the userspace MMIO exit. Suppressing the MMIO write exit is inarguably wrong from an architectural perspective, but it is the least awful hack-a-fix due to shortcomings in KVM's uAPI, not to mention that KVM already suppresses MMIO writes in this scenario. Outside of REP string instructions, KVM doesn't provide a way to resume an instruction at the exact point where it was "interrupted" if said instruction partially completed before encountering an MMIO access. For MMIO reads, KVM immediately exits to userspace upon detecting MMIO as userspace provides the to-be-read value in a buffer, and so KVM can safely (more or less) restart the instruction from the beginning. When the emulator re-encounters the MMIO read, KVM will service the MMIO by getting the value from the buffer instead of exiting to userspace, i.e. KVM won't put the vCPU into an infinite loop. On an emulated MMIO write, KVM finishes the instruction before exiting to userspace, as exiting immediately would ultimately hang the vCPU due to the aforementioned shortcoming of KVM not being able to resume emulation in the middle of an instruction. For the vast majority of _emulated_ instructions, deferring the userspace exit doesn't cause problems as very few x86 instructions (again ignoring string operations) generate multiple writes. But for instructions that generate multiple writes, e.g. PUSHA (multiple pushes onto the stack), deferring the exit effectively results in only the final write triggering an exit to userspace. KVM does support multiple MMIO "fragments", but only for page splits; if an instruction performs multiple distinct MMIO writes, the number of fragments gets reset when the next MMIO write comes along and any previous MMIO writes are dropped. Circling back to the warning, if a deferred MMIO write coincides with an exception, e.g. in this case a #SS due to PUSHA underflowing the stack after queueing a write to an MMIO page on a previous push, KVM injects the exceptions and leaves the deferred MMIO pending without registering a callback, thus triggering the splat. Sweep the problem under the proverbial rug as dropping MMIO writes is not unique to the exception scenario (see above), i.e. instructions like PUSHA are fundamentally broken with respect to MMIO, and have been since KVM's inception. Reported-by: zhangjianguo <zhangjianguo18@huawei.com> Reported-by: syzbot+760a73552f47a8cd0fd9@syzkaller.appspotmail.com Reported-by: syzbot+8accb43ddc6bd1f5713a@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322141220.2206241-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-27 10:13:53 -04:00
Dmytro Maluka	fef8f2b90e	KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking KVM irqfd based emulation of level-triggered interrupts doesn't work quite correctly in some cases, particularly in the case of interrupts that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device in its threaded irq handler, i.e. later than it is acked to the interrupt controller (EOI at the end of hardirq), not earlier. Linux keeps such interrupt masked until its threaded handler finishes, to prevent the EOI from re-asserting an unacknowledged interrupt. However, with KVM + vfio (or whatever is listening on the resamplefd) we always notify resamplefd at the EOI, so vfio prematurely unmasks the host physical IRQ, thus a new physical interrupt is fired in the host. This extra interrupt in the host is not a problem per se. The problem is that it is unconditionally queued for injection into the guest, so the guest sees an extra bogus interrupt. [] There are observed at least 2 user-visible issues caused by those extra erroneous interrupts for a oneshot irq in the guest: 1. System suspend aborted due to a pending wakeup interrupt from ChromeOS EC (drivers/platform/chrome/cros_ec.c). 2. Annoying "invalid report id data" errors from ELAN0000 touchpad (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg every time the touchpad is touched. The core issue here is that by the time when the guest unmasks the IRQ, the physical IRQ line is no longer asserted (since the guest has acked the interrupt to the device in the meantime), yet we unconditionally inject the interrupt queued into the guest by the previous resampling. So to fix the issue, we need a way to detect that the IRQ is no longer pending, and cancel the queued interrupt in this case. With IOAPIC we are not able to probe the physical IRQ line state directly (at least not if the underlying physical interrupt controller is an IOAPIC too), so in this patch we use irqfd resampler for that. Namely, instead of injecting the queued interrupt, we just notify the resampler that this interrupt is done. If the IRQ line is actually already deasserted, we are done. If it is still asserted, a new interrupt will be shortly triggered through irqfd and injected into the guest. In the case if there is no irqfd resampler registered for this IRQ, we cannot fix the issue, so we keep the existing behavior: immediately unconditionally inject the queued interrupt. This patch fixes the issue for x86 IOAPIC only. In the long run, we can fix it for other irqchips and other architectures too, possibly taking advantage of reading the physical state of the IRQ line, which is possible with some other irqchips (e.g. with arm64 GIC, maybe even with the legacy x86 PIC). [] In this description we assume that the interrupt is a physical host interrupt forwarded to the guest e.g. by vfio. Potentially the same issue may occur also with a purely virtual interrupt from an emulated device, e.g. if the guest handles this interrupt, again, as a oneshot interrupt. Signed-off-by: Dmytro Maluka <dmy@semihalf.com> Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/ Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/ Message-Id: <20230322204344.50138-3-dmy@semihalf.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-27 10:13:28 -04:00
Jeremi Piotrowski	e5c972c1fa	KVM: SVM: Flush Hyper-V TLB when required The Hyper-V "EnlightenedNptTlb" enlightenment is always enabled when KVM is running on top of Hyper-V and Hyper-V exposes support for it (which is always). On AMD CPUs this enlightenment results in ASID invalidations not flushing TLB entries derived from the NPT. To force the underlying (L0) hypervisor to rebuild its shadow page tables, an explicit hypercall is needed. The original KVM implementation of Hyper-V's "EnlightenedNptTlb" on SVM only added remote TLB flush hooks. This worked out fine for a while, as sufficient remote TLB flushes where being issued in KVM to mask the problem. Since v5.17, changes in the TDP code reduced the number of flushes and the out-of-sync TLB prevents guests from booting successfully. Split svm_flush_tlb_current() into separate callbacks for the 3 cases (guest/all/current), and issue the required Hyper-V hypercall when a Hyper-V TLB flush is needed. The most important case where the TLB flush was missing is when loading a new PGD, which is followed by what is now svm_flush_tlb_current(). Cc: stable@vger.kernel.org # v5.17+ Fixes: `1e0c7d4075` ("KVM: SVM: hyper-v: Remote TLB flush for SVM") Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20230324145233.4585-1-jpiotrowski@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-27 10:10:26 -04:00
Jithu Joseph	c68e3d4739	x86/include/asm/msr-index.h: Add IFS Array test bits Define MSR bitfields for enumerating support for Array BIST test. Signed-off-by: Jithu Joseph <jithu.joseph@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230322003359.213046-5-jithu.joseph@intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>	2023-03-27 16:10:20 +02:00
Michael Kelley	812b0597fb	x86/hyperv: Change vTOM handling to use standard coco mechanisms Hyper-V guests on AMD SEV-SNP hardware have the option of using the "virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP architecture. With vTOM, shared vs. private memory accesses are controlled by splitting the guest physical address space into two halves. vTOM is the dividing line where the uppermost bit of the physical address space is set; e.g., with 47 bits of guest physical address space, vTOM is 0x400000000000 (bit 46 is set). Guest physical memory is accessible at two parallel physical addresses -- one below vTOM and one above vTOM. Accesses below vTOM are private (encrypted) while accesses above vTOM are shared (decrypted). In this sense, vTOM is like the GPA.SHARED bit in Intel TDX. Support for Hyper-V guests using vTOM was added to the Linux kernel in two patch sets[1][2]. This support treats the vTOM bit as part of the physical address. For accessing shared (decrypted) memory, these patch sets create a second kernel virtual mapping that maps to physical addresses above vTOM. A better approach is to treat the vTOM bit as a protection flag, not as part of the physical address. This new approach is like the approach for the GPA.SHARED bit in Intel TDX. Rather than creating a second kernel virtual mapping, the existing mapping is updated using recently added coco mechanisms. When memory is changed between private and shared using set_memory_decrypted() and set_memory_encrypted(), the PTEs for the existing kernel mapping are changed to add or remove the vTOM bit in the guest physical address, just as with TDX. The hypercalls to change the memory status on the host side are made using the existing callback mechanism. Everything just works, with a minor tweak to map the IO-APIC to use private accesses. To accomplish the switch in approach, the following must be done: * Update Hyper-V initialization to set the cc_mask based on vTOM and do other coco initialization. * Update physical_mask so the vTOM bit is no longer treated as part of the physical address * Remove CC_VENDOR_HYPERV and merge the associated vTOM functionality under CC_VENDOR_AMD. Update cc_mkenc() and cc_mkdec() to set/clear the vTOM bit as a protection flag. * Code already exists to make hypercalls to inform Hyper-V about pages changing between shared and private. Update this code to run as a callback from __set_memory_enc_pgtable(). * Remove the Hyper-V special case from __set_memory_enc_dec() * Remove the Hyper-V specific call to swiotlb_update_mem_attributes() since mem_encrypt_init() will now do it. * Add a Hyper-V specific implementation of the is_private_mmio() callback that returns true for the IO-APIC and vTPM MMIO addresses [1] https://lore.kernel.org/all/20211025122116.264793-1-ltykernel@gmail.com/ [2] https://lore.kernel.org/all/20211213071407.314309-1-ltykernel@gmail.com/ [ bp: Touchups. ] Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/1679838727-87310-7-git-send-email-mikelley@microsoft.com	2023-03-27 09:31:43 +02:00
Michael Kelley	c7b5254bd8	x86/mm: Handle decryption/re-encryption of bss_decrypted consistently sme_postprocess_startup() decrypts the bss_decrypted section when sme_me_mask is non-zero. mem_encrypt_free_decrypted_mem() re-encrypts the unused portion based on CC_ATTR_MEM_ENCRYPT. In a Hyper-V guest VM using vTOM, these conditions are not equivalent as sme_me_mask is always zero when using vTOM. Consequently, mem_encrypt_free_decrypted_mem() attempts to re-encrypt memory that was never decrypted. So check sme_me_mask in mem_encrypt_free_decrypted_mem() too. Hyper-V guests using vTOM don't need the bss_decrypted section to be decrypted, so skipping the decryption/re-encryption doesn't cause a problem. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/1678329614-3482-5-git-send-email-mikelley@microsoft.com	2023-03-27 09:23:21 +02:00
Michael Kelley	d33ddc92db	Drivers: hv: Explicitly request decrypted in vmap_pfn() calls Update vmap_pfn() calls to explicitly request that the mapping be for decrypted access to the memory. There's no change in functionality since the PFNs passed to vmap_pfn() are above the shared_gpa_boundary, implicitly producing a decrypted mapping. But explicitly requesting "decrypted" allows the code to work before and after changes that cause vmap_pfn() to mask the PFNs to being below the shared_gpa_boundary. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Link: https://lore.kernel.org/r/1679838727-87310-4-git-send-email-mikelley@microsoft.com	2023-03-27 08:46:43 +02:00
Michael Kelley	71290be18f	x86/hyperv: Reorder code to facilitate future work Reorder some code to facilitate future work. No functional change. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Link: https://lore.kernel.org/r/1679838727-87310-3-git-send-email-mikelley@microsoft.com	2023-03-27 07:56:40 +02:00
Michael Kelley	88e378d400	x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM Current code always maps MMIO devices as shared (decrypted) in a confidential computing VM. But Hyper-V guest VMs on AMD SEV-SNP with vTOM use a paravisor running in VMPL0 to emulate some devices, such as the IO-APIC and TPM. In such a case, the device must be accessed as private (encrypted) because the paravisor emulates the device at an address below vTOM, where all accesses are encrypted. Add a new hypervisor callback to determine if an MMIO address should be mapped private. The callback allows hypervisor-specific code to handle any quirks, the use of a paravisor, etc. in determining whether a mapping must be private. If the callback is not used by a hypervisor, default to returning "false", which is consistent with normal coco VM behavior. Use this callback as another special case to check for when doing ioremap(). Just checking the starting address is sufficient as an ioremap range must be all private or all shared. Also make the callback in early boot IO-APIC mapping code that uses the fixmap. [ bp: Touchups. ] Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/1678329614-3482-2-git-send-email-mikelley@microsoft.com	2023-03-26 23:42:40 +02:00
Linus Torvalds	974fc94336	- Properly clear perf event status tracking in the AMD perf event overflow handler -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgQDAACgkQEsHwGGHe VUo1OA//YfWLXnc9CBN1S8EefXj0oHHogj3jkVMQLfWOoYmaknaXIHsOg8g4Kihs liUZO0tGWD4WdiUe/GpWmC0KIMtuZ8I1o0ML7LLdxpfexZbMLdlaqUHO68rlHDXt /nJ1mmcA3MJTkdDePIjpI1b8TM+4L8XWzi4XRxMXVbhquuQXGkcqWacq0/rh/0et /5iqzp88varuAwSHTrmX12keYEtRLsfL8+fy5WdQX7L9i1oiSU41IsZ95/wMmjbE +Y7t/EFKiIVH9+JdeM78HLAem+hIXZByNay6E6p+SveNtcjW1QRVviQ+QZ+c71hh 0PQgeuJzArd0NRG7TWLUktCuONj6LZZ7E3apQyzXTPcg6H/IX9JeQv1dR9UGyjx7 mAyrCSS9KzPiydiQLGlOQJeXCRwlNbo7fJKfX/uiCYErEhokQXztbLdHV3G/J/4v nuBQ2t50ZYLWbQoTZ8IOncrwTgDw7lSU8opNaE6G3P8Ut7BXy3f55YMRwKLg0WO7 9X9AYhCMCtUeLwrTHX1SwPZToDI0huCJNTPy94ioSpGSU951CvgXCzX/ZPOsAFjs ndw3TQlI3xDGJETa2Q1fev08UjSHOieFlX5MQ+zLSIf6cSy4s7DeeOOBLXZouKnc jTx+31U1lhGTsqb43ac18CNjpslu67GnyCpzMjy3UymohQzV/nE= =dK66 -----END PGP SIGNATURE----- Merge tag 'perf_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Properly clear perf event status tracking in the AMD perf event overflow handler * tag 'perf_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/core: Always clear status for idx	2023-03-26 09:13:35 -07:00
Linus Torvalds	986c63741d	- Add a AMX ptrace self test - Prevent a false-positive warning when retrieving the (invalid) address of dynamic FPU features in their init state which are not saved in init_fpstate at all - Randomize per-CPU entry areas only when KASLR is enabled -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgPFAACgkQEsHwGGHe VUrAfA//QyZE5JnH0Ber3upRlZ/dPSNKIaOX6DMLshGj7QDqs2utTnjc4pwaqGWD OWpPuAJvOo2+NsN4nfB12venasIzseXDBBhEw6a5kYx73QmFbZ4XswFBLl2Eh8we cFbqU4B8SQvFQaahZ4kRRHpsmNGEPYRvgh2lBjcKUJBUaCuu6KoqE9+I3t173Obc sPfkXmhintDjYIjKfllN78rsBq4uCCaOVu5u299ZFMdBakRtx0M7U3547+4hwoE3 txP+VK+TPs8e64XJtCTem1br8HXNt/W5pC4IoQPnH8V+FLhUp1iIz6FpVHnJ7VMD 9c8VL7e8BNXhKkQn8sSkSVUZV3xNP7n4MbKKbba3f6EWPZnI28WQ3w09LUte/1aa hHEHyjMVyJfUiAcfuE1gZflG1+TqT8GkQJ+hqG9+/iSCWftOMuhfsKCROCLGhltJ yYBoyR2ZC1ErSLIOvgYAEUIeZ9FkzreOU0Pit6P/5qaPu+EXw3uDzoZB0WQH40Z5 PQwz04/s3idPwbfCZDOyNc7QZwxbGu1ESkdiTtCJmbBLW0MkWiBCnf/qZsK7PdD1 Q2qmx86ewIo6QipJpGK9pqWuzwFYNEJJHn3P7T1CcYQnQb+61m+b6WeYozQCgyMF 0dII6JulW98/WzjVgH6zUA0a0dicO7FM9H6iEGqlIcvxv0PuM7M= =eZTj -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Add a AMX ptrace self test - Prevent a false-positive warning when retrieving the (invalid) address of dynamic FPU features in their init state which are not saved in init_fpstate at all - Randomize per-CPU entry areas only when KASLR is enabled * tag 'x86_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests/x86/amx: Add a ptrace test x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_uabi_buf() x86/mm: Do not shuffle CPU entry areas without KASLR	2023-03-26 09:01:24 -07:00
Linus Torvalds	2495697422	xen: branch for v6.3-rc4 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZB2KdAAKCRCAXGG7T9hj vseqAP9OlHO4qqfTWlSmYSPisfWwDT6CM7+K+4vWpMXFh3ZGuAEAhER0mNM1ikoB ZF7Ash778XPt2CaapQLsHtFZqJUn5gw= =ouzt -----END PGP SIGNATURE----- Merge tag 'for-linus-6.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - fix build warning - avoid concurrent accesses to the Xen PV console ring page * tag 'for-linus-6.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/PVH: avoid 32-bit build warning when obtaining VGA console info hvc/xen: prevent concurrent accesses to the shared ring	2023-03-24 09:44:43 -07:00
Valentin Schneider	4c8c3c7f70	treewide: Trace IPIs sent via smp_send_reschedule() To be able to trace invocations of smp_send_reschedule(), rename the arch-specific definitions of it to arch_smp_send_reschedule() and wrap it into an smp_send_reschedule() that contains a tracepoint. Changes to include the declaration of the tracepoint were driven by the following coccinelle script: @func_use@ @@ smp_send_reschedule(...); @include@ @@ #include <trace/events/ipi.h> @no_include depends on func_use && !include@ @@ #include <...> + + #include <trace/events/ipi.h> [csky bits] [riscv bits] Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Link: https://lore.kernel.org/r/20230307143558.294354-6-vschneid@redhat.com	2023-03-24 11:01:28 +01:00
Mathias Krause	12aad91647	KVM: x86: Shrink struct kvm_pmu Move the 'version' member to the beginning of the structure to reuse an existing hole instead of introducing another one. This allows us to save 8 bytes for 64 bit builds. Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230217193336.15278-2-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-23 16:10:01 -07:00
Robert Hoo	99b3086980	KVM: x86: Remove a redundant guest cpuid check in kvm_set_cr4() If !guest_cpuid_has(vcpu, X86_FEATURE_PCID), CR4.PCIDE would have been in vcpu->arch.cr4_guest_rsvd_bits and failed earlier kvm_is_valid_cr4() check. Remove this meaningless check. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Fixes: `4683d758f4` ("KVM: x86: Supplement __cr4_reserved_bits() with X86_FEATURE_PCID check") Link: https://lore.kernel.org/r/20230308072936.1293101-1-robert.hu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-23 16:08:29 -07:00
Sean Christopherson	65966aaca1	KVM: x86: Assert that the emulator doesn't load CS with garbage in !RM Yell loudly if KVM attempts to load CS outside of Real Mode without an accompanying control transfer type, i.e. on X86_TRANSFER_NONE. KVM uses X86_TRANSFER_NONE when emulating IRET and exceptions/interrupts for Real Mode, but IRET emulation for Protected Mode is non-existent. WARN instead of trying to pass in a less-wrong type, e.g. X86_TRANSFER_RET, as emulating IRET goes even beyond emulating FAR RET (which KVM also doesn't fully support). Reported-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/20230216202254.671772-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-23 16:07:52 -07:00
Sean Christopherson	3d8f61bf8b	x86: KVM: Add common feature flag for AMD's PSFD Use a common X86_FEATURE_* flag for AMD's PSFD, and suppress it from /proc/cpuinfo via the standard method of an empty string instead of hacking in a one-off "private" #define in KVM. The request that led to KVM defining its own flag was really just that the feature not show up in /proc/cpuinfo, and additional patches+discussions in the interim have clarified that defining flags in cpufeatures.h purely so that KVM can advertise features to userspace is ok so long as the kernel already uses a word to track the associated CPUID leaf. No functional change intended. Link: https://lore.kernel.org/all/d1b1e0da-29f0-c443-6c86-9549bbe1c79d@redhat.como Link: https://lore.kernel.org/all/YxGZH7aOXQF7Pu5q@nazgul.tnic Link: https://lore.kernel.org/all/Y3O7UYWfOLfJkwM%2F@zn.tnic Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230124194519.2893234-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-23 16:07:29 -07:00
Josh Poimboeuf	fb799447ae	x86,objtool: Split UNWIND_HINT_EMPTY in two Mark reported that the ORC unwinder incorrectly marks an unwind as reliable when the unwind terminates prematurely in the dark corners of return_to_handler() due to lack of information about the next frame. The problem is UNWIND_HINT_EMPTY is used in two different situations: 1) The end of the kernel stack unwind before hitting user entry, boot code, or fork entry 2) A blind spot in ORC coverage where the unwinder has to bail due to lack of information about the next frame The ORC unwinder has no way to tell the difference between the two. When it encounters an undefined stack state with 'end=1', it blindly marks the stack reliable, which can break the livepatch consistency model. Fix it by splitting UNWIND_HINT_EMPTY into UNWIND_HINT_UNDEFINED and UNWIND_HINT_END_OF_STACK. Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/fd6212c8b450d3564b855e1cb48404d6277b4d9f.1677683419.git.jpoimboe@kernel.org	2023-03-23 23:18:58 +01:00
Josh Poimboeuf	4708ea14be	x86,objtool: Separate unret validation from unwind hints The ENTRY unwind hint type is serving double duty as both an empty unwind hint and an unret validation annotation. Unret validation is unrelated to unwinding. Separate it out into its own annotation. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/ff7448d492ea21b86d8a90264b105fbd0d751077.1677683419.git.jpoimboe@kernel.org	2023-03-23 23:18:58 +01:00
Josh Poimboeuf	f902cfdd46	x86,objtool: Introduce ORC_TYPE_* Unwind hints and ORC entry types are two distinct things. Separate them out more explicitly. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/cc879d38fff8a43f8f7beb2fd56e35a5a384d7cd.1677683419.git.jpoimboe@kernel.org	2023-03-23 23:18:57 +01:00
Josh Poimboeuf	d88ebba45d	objtool: Change UNWIND_HINT() argument order The most important argument is 'type', make that one first. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/d994f8c29376c5618c75698df28fc03b52d3a868.1677683419.git.jpoimboe@kernel.org	2023-03-23 23:18:57 +01:00
Josh Poimboeuf	1c0c1faf56	objtool: Use relative pointers for annotations They produce the needed relocations while using half the space. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/bed05c64e28200220c9b1754a2f3ce71f73076ea.1677683419.git.jpoimboe@kernel.org	2023-03-23 23:18:56 +01:00
Santosh Shukla	0977cfac6e	KVM: nSVM: Implement support for nested VNMI Allow L1 to use vNMI to accelerate its injection of NMI to L2 by propagating vNMI int_ctl bits from/to vmcb12 to/from vmcb02. To handle both the case where vNMI is enabled for L1 and L2, and where vNMI is enabled for L1 but _not_ L2, move pending L1 vNMIs to nmi_pending on nested VM-Entry and raise KVM_REQ_EVENT, i.e. rely on existing code to route the NMI to the correct domain. On nested VM-Exit, reverse the process and set/clear V_NMI_PENDING for L1 based one whether nmi_pending is zero or non-zero. There is no need to consider vmcb02 in this case, as V_NMI_PENDING can be set in vmcb02 if vNMI is disabled for L2, and if vNMI is enabled for L2, then L1 and L2 have different NMI contexts. Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Santosh Shukla <santosh.shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-12-santosh.shukla@amd.com [sean: massage changelog to match the code] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 17:43:45 -07:00
Santosh Shukla	fa4c027a79	KVM: x86: Add support for SVM's Virtual NMI Add support for SVM's Virtual NMIs implementation, which adds proper tracking of virtual NMI blocking, and an intr_ctrl flag that software can set to mark a virtual NMI as pending. Pending virtual NMIs are serviced by hardware if/when virtual NMIs become unblocked, i.e. act more or less like real NMIs. Introduce two new kvm_x86_ops callbacks so to support SVM's vNMI, as KVM needs to treat a pending vNMI as partially injected. Specifically, if two NMIs (for L1) arrive concurrently in KVM's software model, KVM's ABI is to inject one and pend the other. Without vNMI, KVM manually tracks the pending NMI and uses NMI windows to detect when the NMI should be injected. With vNMI, the pending NMI is simply stuffed into the VMCB and handed off to hardware. This means that KVM needs to be able to set a vNMI pending on-demand, and also query if a vNMI is pending, e.g. to honor the "at most one NMI pending" rule and to preserve all NMIs across save and restore. Warn if KVM attempts to open an NMI window when vNMI is fully enabled, as the above logic should prevent KVM from ever getting to kvm_check_and_inject_events() with two NMIs pending _in software_, and the "at most one NMI pending" logic should prevent having an NMI pending in hardware and an NMI pending in software if NMIs are also blocked, i.e. if KVM can't immediately inject the second NMI. Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230227084016.3368-11-santosh.shukla@amd.com [sean: rewrite shortlog and changelog, massage code comments] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 17:43:22 -07:00
Sean Christopherson	bdedff2631	KVM: x86: Route pending NMIs from userspace through process_nmi() Use the asynchronous NMI queue to handle pending NMIs coming in from userspace during KVM_SET_VCPU_EVENTS so that all of KVM's logic for handling multiple NMIs goes through process_nmi(). This will simplify supporting SVM's upcoming "virtual NMI" functionality, which will need changes KVM manages pending NMIs. Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 17:40:16 -07:00
Santosh Shukla	1c4522ab13	KVM: SVM: Add definitions for new bits in VMCB::int_ctrl related to vNMI Add defines for three new bits in VMVC::int_ctrl that are part of SVM's Virtual NMI (vNMI) support: V_NMI_PENDING_MASK(11) - Virtual NMI is pending V_NMI_BLOCKING_MASK(12) - Virtual NMI is masked V_NMI_ENABLE_MASK(26) - Enable NMI virtualization To "inject" an NMI, the hypervisor (KVM) sets V_NMI_PENDING. When the CPU services the pending vNMI, hardware clears V_NMI_PENDING and sets V_NMI_BLOCKING, e.g. to indicate that the vCPU is handling an NMI. Hardware clears V_NMI_BLOCKING upon successful execution of IRET, or if a VM-Exit occurs while delivering the virtual NMI. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Santosh Shukla <santosh.shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-10-santosh.shukla@amd.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 17:37:44 -07:00
Binbin Wu	68f7c82ab1	KVM: x86: Change return type of is_long_mode() to bool Change return type of is_long_mode() to bool to avoid implicit cast, as literally every user of is_long_mode() treats its return value as a boolean. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20230322045824.22970-5-binbin.wu@linux.intel.com Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 15:44:33 -07:00
Chang S. Bae	a03c376eba	x86/arch_prctl: Add AMX feature numbers as ABI constants Each distinct XSAVE feature has a number assigned to it. Among other things, the number determines the ordering of features in the XSAVE buffer and is also used to generate XSAVE bitmasks like the value for XCR0. AMX state is dynamically enabled by the architecture-specific prctl(). This prctl() takes one XSAVE feature number as an argument. However, the feature numbers are not defined in any readily available userspace headers. The means that each userspace app trying to use dynamic feature prctl()s will likely end up defining their own constants for each feature. Since these feature numbers are a part of the uabi, expose them in the prctl() uabi header. Save everyone the trouble of looking them up and defining their own. [ dhansen: expand changelog a bit ] Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20230121001900.14900-3-chang.seok.bae%40intel.com	2023-03-22 13:02:33 -07:00
Sean Christopherson	3763bf5802	x86/cpufeatures: Redefine synthetic virtual NMI bit as AMD's "real" vNMI The existing X86_FEATURE_VNMI is a synthetic feature flag that exists purely to maintain /proc/cpuinfo's ABI, the "real" Intel vNMI feature flag is tracked as VMX_FEATURE_VIRTUAL_NMIS, as the feature is enumerated through VMX MSRs, not CPUID. AMD is also gaining virtual NMI support, but in true VMX vs. SVM form, enumerates support through CPUID, i.e. wants to add real feature flag for vNMI. Redefine the syntheic X86_FEATURE_VNMI to AMD's real CPUID bit to avoid having both X86_FEATURE_VNMI and e.g. X86_FEATURE_AMD_VNMI. Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 12:34:34 -07:00
Sean Christopherson	ab2ee212a5	KVM: x86: Save/restore all NMIs when multiple NMIs are pending Save all pending NMIs in KVM_GET_VCPU_EVENTS, and queue KVM_REQ_NMI if one or more NMIs are pending after KVM_SET_VCPU_EVENTS in order to re-evaluate pending NMIs with respect to NMI blocking. KVM allows multiple NMIs to be pending in order to faithfully emulate bare metal handling of simultaneous NMIs (on bare metal, truly simultaneous NMIs are impossible, i.e. one will always arrive first and be consumed). Support for simultaneous NMIs botched the save/restore though. KVM only saves one pending NMI, but allows userspace to restore 255 pending NMIs as kvm_vcpu_events.nmi.pending is a u8, and KVM's internal state is stored in an unsigned int. Fixes: `7460fb4a34` ("KVM: Fix simultaneous NMIs") Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-8-santosh.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 12:34:34 -07:00
Sean Christopherson	400fee8c9b	KVM: x86: Tweak the code and comment related to handling concurrent NMIs Tweak the code and comment that deals with concurrent NMIs to explicitly call out that x86 allows exactly one pending NMI, but that KVM needs to temporarily allow two pending NMIs in order to workaround the fact that the target vCPU cannot immediately recognize an incoming NMI, unlike bare metal. No functional change intended. Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-7-santosh.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 12:34:33 -07:00
Sean Christopherson	2cb9317377	KVM: x86: Raise an event request when processing NMIs if an NMI is pending Don't raise KVM_REQ_EVENT if no NMIs are pending at the end of process_nmi(). Finishing process_nmi() without a pending NMI will become much more likely when KVM gains support for AMD's vNMI, which allows pending vNMIs in hardware, i.e. doesn't require explicit injection. Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-6-santosh.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 12:34:32 -07:00
Maxim Levitsky	772f254d4d	KVM: SVM: add wrappers to enable/disable IRET interception SEV-ES guests don't use IRET interception for the detection of an end of a NMI. Therefore it makes sense to create a wrapper to avoid repeating the check for the SEV-ES. No functional change is intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [Renamed iret intercept API of style svm_{clr,set}_iret_intercept()] Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-5-santosh.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 12:34:32 -07:00
Maxim Levitsky	5d1ec45652	KVM: nSVM: Raise event on nested VM exit if L1 doesn't intercept IRQs If L1 doesn't intercept interrupts, then KVM will use vmcb02's V_IRQ to detect an interrupt window for L1 IRQs. On a subsequent nested VM-Exit, KVM might need to copy the current V_IRQ from vmcb02 to vmcb01 to continue waiting for an interrupt window, i.e. if there is still a pending IRQ for L1. Raise KVM_REQ_EVENT on nested exit if L1 isn't intercepting IRQs to ensure that KVM will re-enable interrupt window detection if needed. Note that this is a theoretical bug because KVM already raises KVM_REQ_EVENT on each nested VM exit, because the nested VM exit resets RFLAGS and kvm_set_rflags() raises the KVM_REQ_EVENT unconditionally. Explicitly raise KVM_REQ_EVENT for the interrupt window case to avoid having an unnecessary dependency on kvm_set_rflags(), and to document the scenario. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [santosh: reworded description as per Sean's v2 comment] Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-4-santosh.shukla@amd.com [sean: further massage changelog and comment] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 12:33:58 -07:00
Santosh Shukla	7334ede457	KVM: nSVM: Disable intercept of VINTR if saved L1 host RFLAGS.IF is 0 Disable intercept of virtual interrupts (used to detect interrupt windows) if the saved host (L1) RFLAGS.IF is '0', as the effective RFLAGS.IF for L1 interrupts will never be set while L2 is running (L2's RFLAGS.IF doesn't affect L1 IRQs when virtual interrupts are enabled). Suggested-by: Sean Christopherson <seanjc@google.com> Link: https://lkml.kernel.org/r/Y9hybI65So5X2LFg%40google.com Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-3-santosh.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 12:19:40 -07:00
Santosh Shukla	5faaffab5b	KVM: nSVM: Don't sync vmcb02 V_IRQ back to vmcb12 if KVM (L0) is intercepting VINTR Don't sync vmcb02 V_IRQ back to vmcb12 if KVM (L0) is intercepting virtual interrupts in order to request an interrupt window, as KVM has usurped vmcb02's int_ctl. If an interrupt window opens before the next VM-Exit, svm_clear_vintr() will restore vmcb12's int_ctl. If no window opens, V_IRQ will be correctly preserved in vmcb12's int_ctl (because it was never recognized while L2 was running). Suggested-by: Sean Christopherson <seanjc@google.com> Link: https://lkml.kernel.org/r/Y9hybI65So5X2LFg%40google.com Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20230227084016.3368-2-santosh.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 12:19:10 -07:00
Luis Chamberlain	89d7971eb2	x86: Simplify one-level sysctl registration for itmt_kern_table There is no need to declare an extra tables to just create directory, this can be easily be done with a prefix path with register_sysctl(). Simplify this registration. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230310233248.3965389-3-mcgrof%40kernel.org	2023-03-22 11:47:21 -07:00
Luis Chamberlain	3f6cc47f5e	x86: Simplify one-level sysctl registration for abi_table2 There is no need to declare an extra tables to just create directory, this can be easily be done with a prefix path with register_sysctl(). Simplify this registration. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230310233248.3965389-2-mcgrof%40kernel.org	2023-03-22 11:47:21 -07:00
Kirill A. Shutemov	7a3a401874	x86/tdx: Drop flags from __tdx_hypercall() After TDX_HCALL_ISSUE_STI got dropped, the only flag left is TDX_HCALL_HAS_OUTPUT. The flag indicates if the caller wants to see tdx_hypercall_args updated based on the hypercall output. Drop the flags and provide __tdx_hypercall_ret() that matches TDX_HCALL_HAS_OUTPUT semantics. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230321003511.9469-1-kirill.shutemov%40linux.intel.com	2023-03-22 11:36:05 -07:00
Uros Bizjak	22767544e9	x86/ACPI/boot: Improve __acpi_acquire_global_lock Improve __acpi_acquire_global_lock by using a temporary variable. This enables compiler to perform if-conversion and improves generated code from: ... 72a: d1 ea shr %edx 72c: 83 e1 fc and $0xfffffffc,%ecx 72f: 83 e2 01 and $0x1,%edx 732: 09 ca or %ecx,%edx 734: 83 c2 02 add $0x2,%edx 737: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 73b: 75 e9 jne 726 <__acpi_acquire_global_lock+0x6> 73d: 83 e2 03 and $0x3,%edx 740: 31 c0 xor %eax,%eax 742: 83 fa 03 cmp $0x3,%edx 745: 0f 95 c0 setne %al 748: f7 d8 neg %eax to: ... 72a: d1 e9 shr %ecx 72c: 83 e2 fc and $0xfffffffc,%edx 72f: 83 e1 01 and $0x1,%ecx 732: 09 ca or %ecx,%edx 734: 83 c2 02 add $0x2,%edx 737: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 73b: 75 e9 jne 726 <__acpi_acquire_global_lock+0x6> 73d: 8d 41 ff lea -0x1(%rcx),%eax BTW: the compiler could generate: lea 0x2(%rcx,%rdx,1),%edx instead of: or %ecx,%edx add $0x2,%edx but unwated conversion from add to or when bits are known to be zero prevents this improvement. This is GCC PR108477. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/all/20230320212012.12704-1-ubizjak%40gmail.com	2023-03-22 11:32:39 -07:00
Andy Shevchenko	b72d6965ee	x86/platform/intel-mid: Remove unused definitions from intel-mid.h After a few rounds of removal and refactoring Intel MID related code some artifacts are left untouched. However, they are not used anywhere. Remove them. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230216193958.2971-1-andriy.shevchenko%40linux.intel.com	2023-03-22 11:08:40 -07:00
Chang S. Bae	b158888402	x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_uabi_buf() __copy_xstate_to_uabi_buf() copies either from the tasks XSAVE buffer or from init_fpstate into the ptrace buffer. Dynamic features, like XTILEDATA, have an all zeroes init state and are not saved in init_fpstate, which means the corresponding bit is not set in the xfeatures bitmap of the init_fpstate header. But __copy_xstate_to_uabi_buf() retrieves addresses for both the tasks xstate and init_fpstate unconditionally via __raw_xsave_addr(). So if the tasks XSAVE buffer has a dynamic feature set, then the address retrieval for init_fpstate triggers the warning in __raw_xsave_addr() which checks the feature bit in the init_fpstate header. Remove the address retrieval from init_fpstate for extended features. They have an all zeroes init state so init_fpstate has zeros for them. Then zeroing the user buffer for the init state is the same as copying them from init_fpstate. Fixes: `2308ee57d9` ("x86/fpu/amx: Enable the AMX feature in 64-bit mode") Reported-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/kvm/20230221163655.920289-2-mizhang@google.com/ Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/all/20230227210504.18520-2-chang.seok.bae%40intel.com Cc: stable@vger.kernel.org	2023-03-22 10:59:13 -07:00
Michal Koutný	a3f547addc	x86/mm: Do not shuffle CPU entry areas without KASLR The commit `97e3d26b5e` ("x86/mm: Randomize per-cpu entry area") fixed an omission of KASLR on CPU entry areas. It doesn't take into account KASLR switches though, which may result in unintended non-determinism when a user wants to avoid it (e.g. debugging, benchmarking). Generate only a single combination of CPU entry areas offsets -- the linear array that existed prior randomization when KASLR is turned off. Since we have `3f148f3318` ("x86/kasan: Map shadow for percpu pages on demand") and followups, we can use the more relaxed guard kasrl_enabled() (in contrast to kaslr_memory_enabled()). Fixes: `97e3d26b5e` ("x86/mm: Randomize per-cpu entry area") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20230306193144.24605-1-mkoutny%40suse.com	2023-03-22 10:42:47 -07:00
Binbin Wu	627778bfcf	KVM: SVM: Use kvm_is_cr4_bit_set() to query SMAP/SMEP in "can emulate" Use kvm_is_cr4_bit_set() to query SMAP and SMEP when determining whether or not AMD's SMAP+SEV errata prevents KVM from emulating an instruction. This eliminates an implicit cast from ulong to bool and makes the code slightly more readable. Note, any overhead from making multiple calls to kvm_read_cr4_bits() is negligible, not to mention the code is question is encountered only in rare situations, i.e. is not a remotely hot path. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20230322045824.22970-4-binbin.wu@linux.intel.com [sean: keep local smap/smep variables, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 10:18:18 -07:00
Binbin Wu	bede6eb4db	KVM: x86: Use boolean return value for is_{pae,pse,paging}() Convert is_{pae,pse,paging}() to use kvm_is_cr{0,4}_bit_set() and return bools. Returning an "int" requires not one, but two implicit casts, first from "unsigned long" to "int", and then again to a "bool". Both casts are more than a bit dangerous; the ulong=>int casts would drop a bit on 64-bit kernels _if_ the bits in question weren't in the lower 32 bits, and the int=>bool cast can result in false negatives/positives, e.g. see commit `0c928ff26b` ("KVM: SVM: Fix benign "bool vs. int" comparison in svm_set_cr0()"). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20230322045824.22970-3-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 10:18:11 -07:00
Binbin Wu	607475cfa0	KVM: x86: Add helpers to query individual CR0/CR4 bits Add helpers to check if a specific CR0/CR4 bit is set to avoid a plethora of implicit casts from the "unsigned long" return of kvm_read_cr*_bits(), and to make each caller's intent more obvious. Defer converting helpers that do truly ugly casts from "unsigned long" to "int", e.g. is_pse(), to a future commit so that their conversion is more isolated. Opportunistically drop the superfluous pcid_enabled from kvm_set_cr3(); the local variable is used only once, immediately after its declaration. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20230322045824.22970-2-binbin.wu@linux.intel.com [sean: move "obvious" conversions to this commit, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 10:10:53 -07:00
Sean Christopherson	0c928ff26b	KVM: SVM: Fix benign "bool vs. int" comparison in svm_set_cr0() Explicitly convert the return from is_paging() to a bool when comparing against old_paging, which is also a boolean. is_paging() sneakily uses kvm_read_cr0_bits() and returns an int, i.e. returns X86_CR0_PG or 0, not 1 or 0. Luckily, the bug is benign as it only results in a false positive, not a false negative, i.e. only causes a spurious refresh of CR4 when paging is enabled in both the old and new. Cc: Maxim Levitsky <mlevitsk@redhat.com> Fixes: `c53bbe2145` ("KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case") Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 10:10:37 -07:00
Jan Beulich	aadbd07ff8	x86/PVH: avoid 32-bit build warning when obtaining VGA console info In the commit referenced below I failed to pay attention to this code also being buildable as 32-bit. Adjust the type of "ret" - there's no real need for it to be wider than 32 bits. Fixes: `934ef33ee7` ("x86/PVH: obtain VGA console info in Dom0") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/2d2193ff-670b-0a27-e12d-2c5c4c121c79@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>	2023-03-22 16:59:46 +01:00
Mathias Krause	fb509f76ac	KVM: VMX: Make CR0.WP a guest owned bit Guests like grsecurity that make heavy use of CR0.WP to implement kernel level W^X will suffer from the implied VMEXITs. With EPT there is no need to intercept a guest change of CR0.WP, so simply make it a guest owned bit if we can do so. This implies that a read of a guest's CR0.WP bit might need a VMREAD. However, the only potentially affected user seems to be kvm_init_mmu() which is a heavy operation to begin with. But also most callers already cache the full value of CR0 anyway, so no additional VMREAD is needed. The only exception is nested_vmx_load_cr3(). This change is VMX-specific, as SVM has no such fine grained control register intercept control. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-7-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 07:47:26 -07:00
Mathias Krause	74cdc83691	KVM: x86: Make use of kvm_read_cr*_bits() when testing bits Make use of the kvm_read_cr{0,4}_bits() helper functions when we only want to know the state of certain bits instead of the whole register. This not only makes the intent cleaner, it also avoids a potential VMREAD in case the tested bits aren't guest owned. Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-5-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 07:47:25 -07:00
Mathias Krause	e40bcf9f3a	KVM: x86: Ignore CR0.WP toggles in non-paging mode If paging is disabled, there are no permission bits to emulate. Micro-optimize this case to avoid unnecessary work. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-4-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 07:47:24 -07:00
Mathias Krause	01b31714bd	KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled There is no need to unload the MMU roots with TDP enabled when only CR0.WP has changed -- the paging structures are still valid, only the permission bitmap needs to be updated. One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to implement kernel W^X. The optimization brings a huge performance gain for this case as the following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM shows (runtime in seconds, lower is better): legacy TDP shadow kvm-x86/next@d8708b 8.43s 9.45s 70.3s +patch 5.39s 5.63s 70.2s For legacy MMU this is ~36% faster, for TDP MMU even ~40% faster. Also TDP and legacy MMU now both have a similar runtime which vanishes the need to disable TDP MMU for grsecurity. Shadow MMU sees no measurable difference and is still slow, as expected. [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-3-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 07:47:24 -07:00
Mathias Krause	50f1399845	KVM: x86/mmu: Fix comment typo Fix a small comment typo in make_spte(). Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-6-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 07:46:53 -07:00
Paolo Bonzini	2fdcc1b324	KVM: x86/mmu: Avoid indirect call for get_cr3 Most of the time, calls to get_guest_pgd result in calling kvm_read_cr3 (the exception is only nested TDP). Hardcode the default instead of using the get_cr3 function, avoiding a retpoline if they are enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-2-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 07:46:42 -07:00
Fangrui Song	aff69273af	vdso: Improve cmd_vdso_check to check all dynamic relocations The actual intention is that no dynamic relocation exists in the VDSO. For this the VDSO build validates that the resulting .so file does not have any relocations which are specified via $(ARCH_REL_TYPE_ABS) per architecture, which is fragile as e.g. ARM64 lacks an entry for R_AARCH64_RELATIVE. Aside of that ARCH_REL_TYPE_ABS is a misnomer as it checks for relative relocations too. However, some GNU ld ports produce unneeded R__NONE relocation entries. If a port fails to determine the exact .rel[a].dyn size, the trailing zeros become R__NONE relocations. E.g. ld's powerpc port recently fixed https://sourceware.org/bugzilla/show_bug.cgi?id=29540). R__NONE are generally a no-op in the dynamic loaders. So just ignore them. Remove the ARCH_REL_TYPE_ABS defines and just validate that the resulting .so file does not contain any R_ relocation entries except R_*_NONE. Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> # for aarch64 Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> # for vDSO, aarch64 Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lore.kernel.org/r/20230310190750.3323802-1-maskray@google.com	2023-03-21 21:15:34 +01:00
Mark Rutland	fee86a4ed5	ftrace: selftest: remove broken trace_direct_tramp The ftrace selftest code has a trace_direct_tramp() function which it uses as a direct call trampoline. This happens to work on x86, since the direct call's return address is in the usual place, and can be returned to via a RET, but in general the calling convention for direct calls is different from regular function calls, and requires a trampoline written in assembly. On s390, regular function calls place the return address in %r14, and an ftrace patch-site in an instrumented function places the trampoline's return address (which is within the instrumented function) in %r0, preserving the original %r14 value in-place. As a regular C function will return to the address in %r14, using a C function as the trampoline results in the trampoline returning to the caller of the instrumented function, skipping the body of the instrumented function. Note that the s390 issue is not detcted by the ftrace selftest code, as the instrumented function is trivial, and returning back into the caller happens to be equivalent. On arm64, regular function calls place the return address in x30, and an ftrace patch-site in an instrumented function saves this into r9 and places the trampoline's return address (within the instrumented function) in x30. A regular C function will return to the address in x30, but will not restore x9 into x30. Consequently, using a C function as the trampoline results in returning to the trampoline's return address having corrupted x30, such that when the instrumented function returns, it will return back into itself. To avoid future issues in this area, remove the trace_direct_tramp() function, and require that each architecture with direct calls provides a stub trampoline, named ftrace_stub_direct_tramp. This can be written to handle the architecture's trampoline calling convention, and in future could be used elsewhere (e.g. in the ftrace ops sample, to measure the overhead of direct calls), so we may as well always build it in. Link: https://lkml.kernel.org/r/20230321140424.345218-8-revest@chromium.org Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Florent Revest <revest@chromium.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-21 13:59:29 -04:00
Yu Zhang	f6cde92083	KVM: nVMX: Add helpers to setup VMX control msr configs nested_vmx_setup_ctls_msrs() is used to set up the various VMX MSR controls for nested VMX. But it is a bit lengthy, just add helpers to setup the configuration of VMX MSRs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20230119141946.585610-2-yu.c.zhang@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-21 10:20:28 -07:00
Yu Zhang	ad36aab37a	KVM: nVMX: Remove outdated comments in nested_vmx_setup_ctls_msrs() nested_vmx_setup_ctls_msrs() initializes the vmcs_conf.nested, which stores the global VMX MSR configurations when nested is supported, regardless of any particular CPUID settings for one VM. Commit `6defc59184` ("KVM: nVMX: include conditional controls in /dev/kvm KVM_GET_MSRS") added the some feature flags for secondary proc-based controls, so that those features can be available in KVM_GET_MSRS. Yet this commit did not remove the obsolete comments in nested_vmx_setup_ctls_msrs(). Just fix the comments, and no functional change intended. Fixes: `6defc59184` ("KVM: nVMX: include conditional controls in /dev/kvm KVM_GET_MSRS") Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20230119141946.585610-1-yu.c.zhang@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-21 10:11:57 -07:00
Dionna Glaze	0144e3b85d	x86/sev: Change snp_guest_issue_request()'s fw_err argument The GHCB specification declares that the firmware error value for a guest request will be stored in the lower 32 bits of EXIT_INFO_2. The upper 32 bits are for the VMM's own error code. The fw_err argument to snp_guest_issue_request() is thus a misnomer, and callers will need access to all 64 bits. The type of unsigned long also causes problems, since sw_exit_info2 is u64 (unsigned long long) vs the argument's unsigned long*. Change this type for issuing the guest request. Pass the ioctl command struct's error field directly instead of in a local variable, since an incomplete guest request may not set the error code, and uninitialized stack memory would be written back to user space. The firmware might not even be called, so bookend the call with the no firmware call error and clear the error. Since the "fw_err" field is really exitinfo2 split into the upper bits' vmm error code and lower bits' firmware error code, convert the 64 bit value to a union. [ bp: - Massage commit message - adjust code - Fix a build issue as Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202303070609.vX6wp2Af-lkp@intel.com - print exitinfo2 in hex Tom: - Correct -EIO exit case. ] Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230214164638.1189804-5-dionnaglaze@google.com Link: https://lore.kernel.org/r/20230307192449.24732-12-bp@alien8.de	2023-03-21 15:43:19 +01:00
Artem Bityutskiy	872d28001b	perf/x86/cstate: Add Granite Rapids support Granite Rapids Xeon is successor or Emerald Rapids Xeon, and it will use the same C-state residency counters as Emerald Rapids (and previous Xeons, all the way back to Ice Lake Xeon). Add Granite Rapids Xeon support. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230314170041.2967712-3-kan.liang@linux.intel.com	2023-03-21 14:43:08 +01:00
Kan Liang	5a796d5cb5	perf/x86/msr: Add Granite Rapids The same as Sapphire Rapids, the SMI_COUNT MSR is also supported on Granite Rapids. Add Granite Rapids model. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230314170041.2967712-2-kan.liang@linux.intel.com	2023-03-21 14:43:08 +01:00
Kan Liang	bc4000fdb0	perf/x86/intel: Add Granite Rapids From core PMU's perspective, Granite Rapids is similar to the Sapphire Rapids. The key differences include: - Doesn't need the AUX event workaround for the mem load event. (Implement in this patch). - Support Retire Latency (Has been implemented in the commit `c87a31093c` ("perf/x86: Support Retire Latency")) - The event list, which will be supported in the perf tool later. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230314170041.2967712-1-kan.liang@linux.intel.com	2023-03-21 14:43:08 +01:00

1 2 3 4 5 ...

43912 Commits