Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now the last users of show_stack() got converted to use an explicit log
level, show_stack_loglvl() can drop it's redundant suffix and become once
again well known show_stack().
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-51-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's under CONFIG_IOMMU_LEAK option which is enabled by debug config.
Likely the backtrace is worth to be seen - so aligning with log level of
error message in iommu_full().
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200418201944.482088-46-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the log-level of show_stack() depends on a platform
realization. It creates situations where the headers are printed with
lower log level or higher than the stacktrace (depending on a platform or
user).
Furthermore, it forces the logic decision from user to an architecture
side. In result, some users as sysrq/kdb/etc are doing tricks with
temporary rising console_loglevel while printing their messages. And in
result it not only may print unwanted messages from other CPUs, but also
omit printing at all in the unlucky case where the printk() was deferred.
Introducing log-level parameter and KERN_UNSUPPRESSED [1] seems an easier
approach than introducing more printk buffers. Also, it will consolidate
printings with headers.
Introduce show_stack_loglvl(), that eventually will substitute
show_stack().
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/T/#u
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200418201944.482088-42-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the log-level of show_stack() depends on a platform
realization. It creates situations where the headers are printed with
lower log level or higher than the stacktrace (depending on a platform or
user).
Furthermore, it forces the logic decision from user to an architecture
side. In result, some users as sysrq/kdb/etc are doing tricks with
temporary rising console_loglevel while printing their messages. And in
result it not only may print unwanted messages from other CPUs, but also
omit printing at all in the unlucky case where the printk() was deferred.
Introducing log-level parameter and KERN_UNSUPPRESSED [1] seems an easier
approach than introducing more printk buffers. Also, it will consolidate
printings with headers.
Keep log_lvl const show_trace_log_lvl() and printk_stack_address() as the
new generic show_stack_loglvl() wants to have a proper const qualifier.
And gcc rightfully produces warnings in case it's not keept:
arch/x86/kernel/dumpstack.c: In function `show_stack':
arch/x86/kernel/dumpstack.c:294:37: warning: passing argument 4 of `show_trace_log_lv ' discards `const' qualifier from pointer target type [-Wdiscarded-qualifiers]
294 | show_trace_log_lvl(task, NULL, sp, loglvl);
| ^~~~~~
arch/x86/kernel/dumpstack.c:163:32: note: expected `char *' but argument is of type `const char *'
163 | unsigned long *stack, char *log_lvl)
| ~~~~~~^~~~~~~
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/T/#u
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200418201944.482088-41-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 srbds fixes from Thomas Gleixner:
"The 9th episode of the dime novel "The performance killer" with the
subtitle "Slow Randomizing Boosts Denial of Service".
SRBDS is an MDS-like speculative side channel that can leak bits from
the random number generator (RNG) across cores and threads. New
microcode serializes the processor access during the execution of
RDRAND and RDSEED. This ensures that the shared buffer is overwritten
before it is released for reuse. This is equivalent to a full bus
lock, which means that many threads running the RNG instructions in
parallel have the same effect as the same amount of threads issuing a
locked instruction targeting an address which requires locking of two
cachelines at once.
The mitigation support comes with the usual pile of unpleasant
ingredients:
- command line options
- sysfs file
- microcode checks
- a list of vulnerable CPUs identified by model and stepping this
time which requires stepping match support for the cpu match logic.
- the inevitable slowdown of affected CPUs"
* branch 'x86/srbds' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Add Ivy Bridge to affected list
x86/speculation: Add SRBDS vulnerability and mitigation documentation
x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigation
x86/cpu: Add 'table' argument to cpu_matches()
The conversion of x86 VDSO to the generic clock mode storage broke the
paravirt and hyperv clocksource logic. These clock sources have their own
internal sequence counter to validate the clocksource at the point of
reading it. This is necessary because the hypervisor can invalidate the
clocksource asynchronously so a check during the VDSO data update is not
sufficient. If the internal check during read invalidates the clocksource
the read return U64_MAX. The original code checked this efficiently by
testing whether the result (casted to signed) is negative, i.e. bit 63 is
set. This was done that way because an extra indicator for the validity had
more overhead.
The conversion broke this check because the check was replaced by a check
for a valid VDSO clock mode.
The wreckage manifests itself when the paravirt clock is installed as a
valid VDSO clock and during runtime invalidated by the hypervisor,
e.g. after a host suspend/resume cycle. After the invalidation the read
function returns U64_MAX which is used as cycles and makes the clock jump
by ~2200 seconds, and become stale until the 2200 seconds have elapsed
where it starts to jump again. The period of this effect depends on the
shift/mult pair of the clocksource and the jumps and staleness are an
artifact of undefined but reproducible behaviour of math overflow.
Implement an x86 version of the new vdso_cycles_ok() inline which adds this
check back and a variant of vdso_clocksource_ok() which lets the compiler
optimize it out to avoid the extra conditional. That's suboptimal when the
system does not have a VDSO capable clocksource, but that's not the case
which is optimized for.
Fixes: 5d51bee725 ("clocksource: Add common vdso clock mode storage")
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200606221532.080560273@linutronix.de
Make x86_fpu_cache static now that FPU allocation and destruction is
handled entirely by common x86 code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200608180218.20946-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'jiffies' and 'jiffies_64' are meant to alias (two different symbols that
share the same address). Most architectures make the symbols alias to the
same address via a linker script assignment in their
arch/<arch>/kernel/vmlinux.lds.S:
jiffies = jiffies_64;
which is effectively a definition of jiffies.
jiffies and jiffies_64 are both forward declared for all architectures in
include/linux/jiffies.h. jiffies_64 is defined in kernel/time/timer.c.
x86_64 was peculiar in that it wasn't doing the above linker script
assignment, but rather was:
1. defining jiffies in arch/x86/kernel/time.c instead via the linker script.
2. overriding the symbol jiffies_64 from kernel/time/timer.c in
arch/x86/kernel/vmlinux.lds.s via 'jiffies_64 = jiffies;'.
As Fangrui notes:
In LLD, symbol assignments in linker scripts override definitions in
object files. GNU ld appears to have the same behavior. It would
probably make sense for LLD to error "duplicate symbol" but GNU ld
is unlikely to adopt for compatibility reasons.
This results in an ODR violation (UB), which seems to have survived
thus far. Where it becomes harmful is when;
1. -fno-semantic-interposition is used:
As Fangrui notes:
Clang after LLVM commit 5b22bcc2b70d
("[X86][ELF] Prefer to lower MC_GlobalAddress operands to .Lfoo$local")
defaults to -fno-semantic-interposition similar semantics which help
-fpic/-fPIC code avoid GOT/PLT when the referenced symbol is defined
within the same translation unit. Unlike GCC
-fno-semantic-interposition, Clang emits such relocations referencing
local symbols for non-pic code as well.
This causes references to jiffies to refer to '.Ljiffies$local' when
jiffies is defined in the same translation unit. Likewise, references to
jiffies_64 become references to '.Ljiffies_64$local' in translation units
that define jiffies_64. Because these differ from the names used in the
linker script, they will not be rewritten to alias one another.
2. Full LTO
Full LTO effectively treats all source files as one translation
unit, causing these local references to be produced everywhere. When
the linker processes the linker script, there are no longer any
references to jiffies_64' anywhere to replace with 'jiffies'. And
thus '.Ljiffies$local' and '.Ljiffies_64$local' no longer alias
at all.
In the process of porting patches enabling Full LTO from arm64 to x86_64,
spooky bugs have been observed where the kernel appeared to boot, but init
doesn't get scheduled.
Avoid the ODR violation by matching other architectures and define jiffies
only by linker script. For -fno-semantic-interposition + Full LTO, there
is no longer a global definition of jiffies for the compiler to produce a
local symbol which the linker script won't ensure aliases to jiffies_64.
Fixes: 40747ffa5a ("asmlinkage: Make jiffies visible")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Alistair Delva <adelva@google.com>
Debugged-by: Nick Desaulniers <ndesaulniers@google.com>
Debugged-by: Sami Tolvanen <samitolvanen@google.com>
Suggested-by: Fangrui Song <maskray@google.com>
Signed-off-by: Bob Haarman <inglorion@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # build+boot on
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: https://github.com/ClangBuiltLinux/linux/issues/852
Link: https://lkml.kernel.org/r/20200602193100.229287-1-inglorion@google.com
Currently, it is possible to enable indirect branch speculation even after
it was force-disabled using the PR_SPEC_FORCE_DISABLE option. Moreover, the
PR_GET_SPECULATION_CTRL command gives afterwards an incorrect result
(force-disabled when it is in fact enabled). This also is inconsistent
vs. STIBP and the documention which cleary states that
PR_SPEC_FORCE_DISABLE cannot be undone.
Fix this by actually enforcing force-disabled indirect branch
speculation. PR_SPEC_ENABLE called after PR_SPEC_FORCE_DISABLE now fails
with -EPERM as described in the documentation.
Fixes: 9137bb27e6 ("x86/speculation: Add prctl() control for indirect branch speculation")
Signed-off-by: Anthony Steinhauser <asteinhauser@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
On context switch the change of TIF_SSBD and TIF_SPEC_IB are evaluated
to adjust the mitigations accordingly. This is optimized to avoid the
expensive MSR write if not needed.
This optimization is buggy and allows an attacker to shutdown the SSBD
protection of a victim process.
The update logic reads the cached base value for the speculation control
MSR which has neither the SSBD nor the STIBP bit set. It then OR's the
SSBD bit only when TIF_SSBD is different and requests the MSR update.
That means if TIF_SSBD of the previous and next task are the same, then
the base value is not updated, even if TIF_SSBD is set. The MSR write is
not requested.
Subsequently if the TIF_STIBP bit differs then the STIBP bit is updated
in the base value and the MSR is written with a wrong SSBD value.
This was introduced when the per task/process conditional STIPB
switching was added on top of the existing SSBD switching.
It is exploitable if the attacker creates a process which enforces SSBD
and has the contrary value of STIBP than the victim process (i.e. if the
victim process enforces STIBP, the attacker process must not enforce it;
if the victim process does not enforce STIBP, the attacker process must
enforce it) and schedule it on the same core as the victim process. If
the victim runs after the attacker the victim becomes vulnerable to
Spectre V4.
To fix this, update the MSR value independent of the TIF_SSBD difference
and dependent on the SSBD mitigation method available. This ensures that
a subsequent STIPB initiated MSR write has the correct state of SSBD.
[ tglx: Handle X86_FEATURE_VIRT_SSBD & X86_FEATURE_VIRT_SSBD correctly
and massaged changelog ]
Fixes: 5bfbe3ad58 ("x86/speculation: Prepare for per task indirect branch speculation control")
Signed-off-by: Anthony Steinhauser <asteinhauser@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
When STIBP is unavailable or enhanced IBRS is available, Linux
force-disables the IBPB mitigation of Spectre-BTB even when simultaneous
multithreading is disabled. While attempts to enable IBPB using
prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, ...) fail with
EPERM, the seccomp syscall (or its prctl(PR_SET_SECCOMP, ...) equivalent)
which are used e.g. by Chromium or OpenSSH succeed with no errors but the
application remains silently vulnerable to cross-process Spectre v2 attacks
(classical BTB poisoning). At the same time the SYSFS reporting
(/sys/devices/system/cpu/vulnerabilities/spectre_v2) displays that IBPB is
conditionally enabled when in fact it is unconditionally disabled.
STIBP is useful only when SMT is enabled. When SMT is disabled and STIBP is
unavailable, it makes no sense to force-disable also IBPB, because IBPB
protects against cross-process Spectre-BTB attacks regardless of the SMT
state. At the same time since missing STIBP was only observed on AMD CPUs,
AMD does not recommend using STIBP, but recommends using IBPB, so disabling
IBPB because of missing STIBP goes directly against AMD's advice:
https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf
Similarly, enhanced IBRS is designed to protect cross-core BTB poisoning
and BTB-poisoning attacks from user space against kernel (and
BTB-poisoning attacks from guest against hypervisor), it is not designed
to prevent cross-process (or cross-VM) BTB poisoning between processes (or
VMs) running on the same core. Therefore, even with enhanced IBRS it is
necessary to flush the BTB during context-switches, so there is no reason
to force disable IBPB when enhanced IBRS is available.
Enable the prctl control of IBPB even when STIBP is unavailable or enhanced
IBRS is available.
Fixes: 7cc765a67d ("x86/speculation: Enable prctl mode for spectre_v2_user")
Signed-off-by: Anthony Steinhauser <asteinhauser@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
This seems to lead to some crazy include loops when using
asm-generic/cacheflush.h on more architectures, so leave it to the arch
header for now.
[hch@lst.de: fix warning]
Link: http://lkml.kernel.org/r/20200520173520.GA11199@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Will Deacon <will@kernel.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/20200515143646.3857579-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b1394e745b ("KVM: x86: fix APIC page invalidation") tried
to fix inappropriate APIC page invalidation by re-introducing arch
specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
kvm_mmu_notifier_invalidate_range_start. However, the patch left a
possible race where the VMCS APIC address cache is updated *before*
it is unmapped:
(Invalidator) kvm_mmu_notifier_invalidate_range_start()
(Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
(KVM VCPU) vcpu_enter_guest()
(KVM VCPU) kvm_vcpu_reload_apic_access_page()
(Invalidator) actually unmap page
Because of the above race, there can be a mismatch between the
host physical address stored in the APIC_ACCESS_PAGE VMCS field and
the host physical address stored in the EPT entry for the APIC GPA
(0xfee0000). When this happens, the processor will not trap APIC
accesses, and will instead show the raw contents of the APIC-access page.
Because Windows OS periodically checks for unexpected modifications to
the LAPIC register, this will show up as a BSOD crash with BugCheck
CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
cannot guarantee that no additional references are taken to the pages in
the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately,
this case is supported by the MMU notifier API, as documented in
include/linux/mmu_notifier.h:
* If the subsystem
* can't guarantee that no additional references are taken to
* the pages in the range, it has to implement the
* invalidate_range() notifier to remove any references taken
* after invalidate_range_start().
The fix therefore is to reload the APIC-access page field in the VMCS
from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
Cc: stable@vger.kernel.org
Fixes: b1394e745b ("KVM: x86: fix APIC page invalidation")
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
is_intercept takes an INTERCEPT_* constant, not SVM_EXIT_*; because
of this, the compiler was removing the body of the conditionals,
as if is_intercept returned 0.
This unveils a latent bug: when clearing the VINTR intercept,
int_ctl must also be changed in the L1 VMCB (svm->nested.hsave),
just like the intercept itself is also changed in the L1 VMCB.
Otherwise V_IRQ remains set and, due to the VINTR intercept being clear,
we get a spurious injection of a vector 0 interrupt on the next
L2->L1 vmexit.
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
handle_vmptrst()/handle_vmread() stopped injecting #PF unconditionally
and switched to nested_vmx_handle_memory_failure() which just kills the
guest with KVM_EXIT_INTERNAL_ERROR in case of MMIO access, zeroing
'exception' in kvm_write_guest_virt_system() is not needed anymore.
This reverts commit 541ab2aeb2.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200605115906.532682-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Syzbot reports the following issue:
WARNING: CPU: 0 PID: 6819 at arch/x86/kvm/x86.c:618
kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
...
Call Trace:
...
RIP: 0010:kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
...
nested_vmx_get_vmptr+0x1f9/0x2a0 arch/x86/kvm/vmx/nested.c:4638
handle_vmon arch/x86/kvm/vmx/nested.c:4767 [inline]
handle_vmon+0x168/0x3a0 arch/x86/kvm/vmx/nested.c:4728
vmx_handle_exit+0x29c/0x1260 arch/x86/kvm/vmx/vmx.c:6067
'exception' we're trying to inject with kvm_inject_emulated_page_fault()
comes from:
nested_vmx_get_vmptr()
kvm_read_guest_virt()
kvm_read_guest_virt_helper()
vcpu->arch.walk_mmu->gva_to_gpa()
but it is only set when GVA to GPA conversion fails. In case it doesn't but
we still fail kvm_vcpu_read_guest_page(), X86EMUL_IO_NEEDED is returned and
nested_vmx_get_vmptr() calls kvm_inject_emulated_page_fault() with zeroed
'exception'. This happen when the argument is MMIO.
Paolo also noticed that nested_vmx_get_vmptr() is not the only place in
KVM code where kvm_read/write_guest_virt*() return result is mishandled.
VMX instructions along with INVPCID have the same issue. This was already
noticed before, e.g. see commit 541ab2aeb2 ("KVM: x86: work around
leak of uninitialized stack contents") but was never fully fixed.
KVM could've handled the request correctly by going to userspace and
performing I/O but there doesn't seem to be a good need for such requests
in the first place.
Introduce vmx_handle_memory_failure() as an interim solution.
Note, nested_vmx_get_vmptr() now has three possible outcomes: OK, PF,
KVM_EXIT_INTERNAL_ERROR and callers need to know if userspace exit is
needed (for KVM_EXIT_INTERNAL_ERROR) in case of failure. We don't seem
to have a good enum describing this tristate, just add "int *ret" to
nested_vmx_get_vmptr() interface to pass the information.
Reported-by: syzbot+2a7156e11dc199bdbd8a@syzkaller.appspotmail.com
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200605115906.532682-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- enhance the dma pool to allow atomic allocation on x86 with AMD SEV
(David Rientjes)
- two small cleanups (Jason Yan and Peter Collingbourne)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl7bvTULHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMJVhAAgTiWNzxPJhM6RTeRooM6W0NvcZGTJT6ExyJghaau
aJvHUjXPrRmeBM8Zjwbbu5dioncd8c7npfRjBvATaEL74pa1u9gH3jnUTxh6L4WQ
/FTNYryZVbprXJsdFuDZvCsO/CChqfZL8PWz+NFgIpICOyyXdorQELMhCaeOhnfU
/goq6SvKmPlmXdb4eM2fXRD7udt1qlp+Oq2EZUdT3Xb4CBFsWUYbOMde22VY390Z
2E9mEztOaKjNgAM/TfCoXo7iRUSwxcpO5aSliDhJJ/7uWaxyWTzFlaoIlwIkkNKb
TcguNJbIZtjIXwBMv9gS6CqVEgFymmWqX5Tr23+vbb7S/235HqKtN1dPmV2h4R0H
QOpvYXfm6kc4tpH4J32NMp+IqfQmwgMbNtUsiXWk5Lxl27cb8K2Q5eqEwxRWMbG+
HObO7Kzb8oCygWwozZ+3QcWSr+9QAgzsb4Jl4jg6adjd8LDcbmKo4B9TKptGpVnL
xjDleKdb/P4Vq55q9KHFLjqFUesuQIv2mKl2s+zr2BqROxjZ562kM9QHwsoCqc4Q
tFuVed+XOoT7yhdKdtwEK7lwcQBtZgP5l/HgsoosmuJ975holsQ4pbKSf4A2Y4yo
XwHYonSwOAEbi4nPxnvKIm4aUNq+PC44TH0VJcXud3tmQ/DGipdlLW8/nyw9ecfa
qaQ=
=GT3J
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.8' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- enhance the dma pool to allow atomic allocation on x86 with AMD SEV
(David Rientjes)
- two small cleanups (Jason Yan and Peter Collingbourne)
* tag 'dma-mapping-5.8' of git://git.infradead.org/users/hch/dma-mapping:
dma-contiguous: fix comment for dma_release_from_contiguous
dma-pool: scale the default DMA coherent pool size with memory capacity
x86/mm: unencrypted non-blocking DMA allocations use coherent pools
dma-pool: add pool sizes to debugfs
dma-direct: atomic allocations must come from atomic coherent pools
dma-pool: dynamically expanding atomic pools
dma-pool: add additional coherent pools to map to gfp mask
dma-remap: separate DMA atomic pools from direct remap code
dma-debug: make __dma_entry_alloc_check_leak() static
now that toolchains long support PT_GNU_STACK marking and there's no
need anymore to force modern programs into having all its user mappings
executable instead of only the stack and the PROT_EXEC ones. Disable
that automatic READ_IMPLIES_EXEC forcing on x86-64 and arm64. Add tables
documenting how READ_IMPLIES_EXEC is handled on x86-64, arm and arm64.
By Kees Cook.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7YFDIACgkQEsHwGGHe
VUpnzxAAmXdODNOb1gGQvt+KJthkfkWh2A2R+tWxCRmFtjFTcS/eRxFfvGu2KmFY
2b2AcJzuJeGjs7WIvQU0pkR2p6STyzuSBBLj5J/OJR9FonQ4pPah38df4A0fOgI6
GJyJV9Ie7O2Ph1w2iLOeWBdmR90CnYuabxsfipgOL+sjHlEI0RqLSDgARRQsxTEj
KM+JVAFD472KcUJnQKBVBOD1I1DOVBGu12r3y6chgsOtwshLNW/cO15cDgYrgnJZ
OlR3EIUukCEEc1KQzUCihsypLuGfrmdq1MyPN8CME8gLfmOBsJyGRDhvmdbS+Wxh
kAMYQ9BuNP/jMVtN950qV0qUtnZCeIPlj1sDb9STWz5fInLsXDSCS0eYi32yBFi+
7yviVU95ml6Mda1Qd5axItTHFAjKIn0qfMZszkLOtUszIzNinCgH7t3ThoXeV223
BqrpntRwiGZVpXDdcp0QFYBsWSMchR47yuhL8pB4SWxQzgNzXqAEg2KFQU0XMDKp
pdia9IzUozg/BrjG5cnRfZhq2lBra7fy3Dn6fw5+NR5vqhka0Wr8L6dyM1Rj74EU
HPk5bRXgt0OIiIFPi4139ApY7k+8j2nbf12qUchue1ZVVKzbvK996FDXbrGgW3zD
Wis1wglxB9urSUTmC1bMOeyOd+gebo3i/ACAjgSo+EbDN7qW0Qw=
=2L7y
-----END PGP SIGNATURE-----
Merge tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull READ_IMPLIES_EXEC changes from Borislav Petkov:
"Split the old READ_IMPLIES_EXEC workaround from executable
PT_GNU_STACK now that toolchains long support PT_GNU_STACK marking and
there's no need anymore to force modern programs into having all its
user mappings executable instead of only the stack and the PROT_EXEC
ones.
Disable that automatic READ_IMPLIES_EXEC forcing on x86-64 and
arm64.
Add tables documenting how READ_IMPLIES_EXEC is handled on x86-64, arm
and arm64.
By Kees Cook"
* tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm64/elf: Disable automatic READ_IMPLIES_EXEC for 64-bit address spaces
arm32/64/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
arm32/64/elf: Add tables to document READ_IMPLIES_EXEC
x86/elf: Disable automatic READ_IMPLIES_EXEC on 64-bit
x86/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
x86/elf: Add table to document READ_IMPLIES_EXEC
- Unexport various PAT primitives
- Unexport per-CPU tlbstate
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7Z+3cRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jgyxAAjPoXEzi9rqGHY6Eus37DNbzHtdQj4fqN
68h8T2tSnOMzETe3L/c4puxI50YFpMA0sFbzm8BfjCtucs0K7Tj4Sv8Aoap2b99A
/bP+ySgHh2BMoI/tu9TiD8et+vttAGGwkXQhIOgeakZcYzpAY7oUNwc+CogkytbQ
DaC8s9FL7RjCXCL91fvZ33C0ksg5J9ynFbRozEHOacHPrE3CbrqUwu+75PmS7nJC
13vatOxjdqNPQhVMg7waN1nHv7K06kph1wxWxYHoD0QwAPy1ecE84wLvg9gv5AqK
BfUBmB34qRW21qbB5tQrMlGDS9tuV0vUB1fxUV7/iOKXQUH6viEG/7J7jm+YwXji
U9S54UPj/TOp8fvYdS18sp6vI1gS3HKjd3LO3pPHWsyZVMJBoGuMConZRs3C31Cp
WuwBU1gY+mFB5l4prt8WU8ocPvEnZkP00cCYNyzPk21tblfUwFbrmu3wcZxOkx3s
ZhRO4KrhxtL7l/wDLuNtWShBL2c6Rz2tts58tr/fj/M+UscJK2MPKxPLCAb20QYZ
qSkMa36+r8LkuMCyjpegEEmo4sw9yC6aLXFKfYu2ABki5o9AR4tavk+lwO+dad6T
k0DJjGXLsG9sReR6hrfaNTk5h7ImiRFDVntnWAhgKhARRoloJJS4/RkzW+ylPbac
mTuNNJDChUQ=
=RXKK
-----END PGP SIGNATURE-----
Merge tag 'x86-mm-2020-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
"Misc changes:
- Unexport various PAT primitives
- Unexport per-CPU tlbstate and uninline TLB helpers"
* tag 'x86-mm-2020-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/tlb/uv: Add a forward declaration for struct flush_tlb_info
x86/cpu: Export native_write_cr4() only when CONFIG_LKTDM=m
x86/tlb: Restrict access to tlbstate
xen/privcmd: Remove unneeded asm/tlb.h include
x86/tlb: Move PCID helpers where they are used
x86/tlb: Uninline nmi_uaccess_okay()
x86/tlb: Move cr4_set_bits_and_update_boot() to the usage site
x86/tlb: Move paravirt_tlb_remove_table() to the usage site
x86/tlb: Move __flush_tlb_all() out of line
x86/tlb: Move flush_tlb_others() out of line
x86/tlb: Move __flush_tlb_one_kernel() out of line
x86/tlb: Move __flush_tlb_one_user() out of line
x86/tlb: Move __flush_tlb_global() out of line
x86/tlb: Move __flush_tlb() out of line
x86/alternatives: Move temporary_mm helpers into C
x86/cr4: Sanitize CR4.PCE update
x86/cpu: Uninline CR4 accessors
x86/tlb: Uninline __get_current_cr3_fast()
x86/mm: Use pgprotval_t in protval_4k_2_large() and protval_large_2_4k()
x86/mm: Unexport __cachemode2pte_tbl
...
Instructions starting with 0f18 up to 0f1f are reserved nops, except those
that were assigned to MPX. These include the endbr markers used by CET.
List them correctly in the opcode table.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge yet more updates from Andrew Morton:
- More MM work. 100ish more to go. Mike Rapoport's "mm: remove
__ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue
- Various other little subsystems
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
lib/ubsan.c: fix gcc-10 warnings
tools/testing/selftests/vm: remove duplicate headers
selftests: vm: pkeys: fix multilib builds for x86
selftests: vm: pkeys: use the correct page size on powerpc
selftests/vm/pkeys: override access right definitions on powerpc
selftests/vm/pkeys: test correct behaviour of pkey-0
selftests/vm/pkeys: introduce a sub-page allocator
selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
selftests/vm/pkeys: associate key on a mapped page and detect write violation
selftests/vm/pkeys: associate key on a mapped page and detect access violation
selftests/vm/pkeys: improve checks to determine pkey support
selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
selftests/vm/pkeys: fix number of reserved powerpc pkeys
selftests/vm/pkeys: introduce powerpc support
selftests/vm/pkeys: introduce generic pkey abstractions
selftests: vm: pkeys: use the correct huge page size
selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
selftests/vm/pkeys: fix pkey_disable_clear()
selftests: vm: pkeys: add helpers for pkey bits
...
Most architectures define kmap_prot to be PAGE_KERNEL.
Let sparc and xtensa define there own and define PAGE_KERNEL as the
default if not overridden.
[akpm@linux-foundation.org: coding style fixes]
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200507150004.1423069-16-ira.weiny@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To support kmap_atomic_prot(), all architectures need to support
protections passed to their kmap_atomic_high() function. Pass protections
into kmap_atomic_high() and change the name to kmap_atomic_high_prot() to
match.
Then define kmap_atomic_prot() as a core function which calls
kmap_atomic_high_prot() when needed.
Finally, redefine kmap_atomic() as a wrapper of kmap_atomic_prot() with
the default kmap_prot exported by the architectures.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200507150004.1423069-11-ira.weiny@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every single architecture (including !CONFIG_HIGHMEM) calls...
pagefault_enable();
preempt_enable();
... before returning from __kunmap_atomic(). Lift this code into the
kunmap_atomic() macro.
While we are at it rename __kunmap_atomic() to kunmap_atomic_high() to
be consistent.
[ira.weiny@intel.com: don't enable pagefault/preempt twice]
Link: http://lkml.kernel.org/r/20200518184843.3029640-1-ira.weiny@intel.com
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Link: http://lkml.kernel.org/r/20200507150004.1423069-8-ira.weiny@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every arch has the same code to ensure atomic operations and a check for
!HIGHMEM page.
Remove the duplicate code by defining a core kmap_atomic() which only
calls the arch specific kmap_atomic_high() when the page is high memory.
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200507150004.1423069-7-ira.weiny@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During this kmap() conversion series we must maintain bisect-ability. To
do this, kmap_atomic_prot() in x86, powerpc, and microblaze need to remain
functional.
Create a temporary inline version of kmap_atomic_prot within these
architectures so we can rework their kmap_atomic() calls and then lift
kmap_atomic_prot() to the core.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200507150004.1423069-6-ira.weiny@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All architectures do exactly the same thing for kunmap(); remove all the
duplicate definitions and lift the call to the core.
This also has the benefit of changing kmap_unmap() on a number of
architectures to be an inline call rather than an actual function.
[akpm@linux-foundation.org: fix CONFIG_HIGHMEM=n build on various architectures]
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200507150004.1423069-5-ira.weiny@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kmap code for all the architectures is almost 100% identical.
Lift the common code to the core. Use ARCH_HAS_KMAP_FLUSH_TLB to indicate
if an arch defines kmap_flush_tlb() and call if if needed.
This also has the benefit of changing kmap() on a number of architectures
to be an inline call rather than an actual function.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200507150004.1423069-4-ira.weiny@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Remove duplicated kmap code", v3.
The kmap infrastructure has been copied almost verbatim to every
architecture. This series consolidates obvious duplicated code by
defining core functions which call into the architectures only when
needed.
Some of the k[un]map_atomic() implementations have some similarities but
the similarities were not sufficient to warrant further changes.
In addition we remove a duplicate implementation of kmap() in DRM.
This patch (of 15):
Replace the use of BUG_ON(in_interrupt()) in the kmap() and kunmap() in
favor of might_sleep().
Besides the benefits of might_sleep(), this normalizes the implementations
such that they can be made generic in subsequent patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian König <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Link: http://lkml.kernel.org/r/20200507150004.1423069-1-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20200507150004.1423069-2-ira.weiny@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds tests which will validate architecture page table helpers and
other accessors in their compliance with expected generic MM semantics.
This will help various architectures in validating changes to existing
page table helpers or addition of new ones.
This test covers basic page table entry transformations including but not
limited to old, young, dirty, clean, write, write protect etc at various
level along with populating intermediate entries with next page table page
and validating them.
Test page table pages are allocated from system memory with required size
and alignments. The mapped pfns at page table levels are derived from a
real pfn representing a valid kernel text symbol. This test gets called
via late_initcall().
This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected.
Any architecture, which is willing to subscribe this test will need to
select ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64,
x86, s390 and powerpc platforms where the test is known to build and run
successfully Going forward, other architectures too can subscribe the test
after fixing any build or runtime problems with their page table helpers.
Folks interested in making sure that a given platform's page table helpers
conform to expected generic MM semantics should enable the above config
which will just trigger this test during boot. Any non conformity here
will be reported as an warning which would need to be fixed. This test
will help catch any changes to the agreed upon semantics expected from
generic MM and enable platforms to accommodate it thereafter.
[anshuman.khandual@arm.com: v17]
Link: http://lkml.kernel.org/r/1587436495-22033-3-git-send-email-anshuman.khandual@arm.com
[anshuman.khandual@arm.com: v18]
Link: http://lkml.kernel.org/r/1588564865-31160-3-git-send-email-anshuman.khandual@arm.com
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr> [ppc32]
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Link: http://lkml.kernel.org/r/1583919272-24178-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/debug: Add tests validating architecture page table
helpers", v18.
This adds a test validation for architecture exported page table helpers.
Patch adds basic transformation tests at various levels of the page table.
This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.
https://lore.kernel.org/linux-mm/20190628102003.GA56463@arrakis.emea.arm.com/
This patch (of 2):
This just defines mm_p4d_folded() to check whether P4D page table level is
folded at runtime.
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1587436495-22033-2-git-send-email-anshuman.khandual@arm.com
Link: http://lkml.kernel.org/r/1588564865-31160-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original code in mm/mremap.c checks huge pmd by:
if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
However, a DAX mapped nvdimm is mapped as huge page (by default) but it
is not transparent huge page (_PAGE_PSE | PAGE_DEVMAP). This commit
changes the condition to include the case.
This addresses CVE-2020-10757.
Fixes: 5c7fb56e5e ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd")
Cc: <stable@vger.kernel.org>
Reported-by: Fan Yang <Fan_Yang@sjtu.edu.cn>
Signed-off-by: Fan Yang <Fan_Yang@sjtu.edu.cn>
Tested-by: Fan Yang <Fan_Yang@sjtu.edu.cn>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull execve updates from Eric Biederman:
"Last cycle for the Nth time I ran into bugs and quality of
implementation issues related to exec that could not be easily be
fixed because of the way exec is implemented. So I have been digging
into exec and cleanup up what I can.
I don't think I have exec sorted out enough to fix the issues I
started with but I have made some headway this cycle with 4 sets of
changes.
- promised cleanups after introducing exec_update_mutex
- trivial cleanups for exec
- control flow simplifications
- remove the recomputation of bprm->cred
The net result is code that is a bit easier to understand and work
with and a decrease in the number of lines of code (if you don't count
the added tests)"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
exec: Compute file based creds only once
exec: Add a per bprm->file version of per_clear
binfmt_elf_fdpic: fix execfd build regression
selftests/exec: Add binfmt_script regression test
exec: Remove recursion from search_binary_handler
exec: Generic execfd support
exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
exec: Move the call of prepare_binprm into search_binary_handler
exec: Allow load_misc_binary to call prepare_binprm unconditionally
exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
exec: Teach prepare_exec_creds how exec treats uids & gids
exec: Set the point of no return sooner
exec: Move handling of the point of no return to the top level
exec: Run sync_mm_rss before taking exec_update_mutex
exec: Fix spelling of search_binary_handler in a comment
exec: Move the comment from above de_thread to above unshare_sighand
exec: Rename flush_old_exec begin_new_exec
exec: Move most of setup_new_exec into flush_old_exec
exec: In setup_new_exec cache current in the local variable me
...
Consolidate the code and correct the comments to show that the actions
taken to update existing mappings to disable or enable dirty logging
are not necessary when creating, moving, or deleting a memslot.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Message-Id: <1591128450-11977-4-git-send-email-anthony.yznaga@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On large memory guests it has been observed that creating a memslot
for a very large range can take noticeable amount of time.
Investigation showed that the time is spent walking the rmaps to update
existing sptes to remove write access or set/clear dirty bits to support
dirty logging. These rmap walks are unnecessary when creating or moving
a memslot. A newly created memslot will not have any existing mappings,
and the existing mappings of a moved memslot will have been invalidated
and flushed. Any mappings established once the new/moved memslot becomes
visible will be set using the properties of the new slot.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Message-Id: <1591128450-11977-3-git-send-email-anthony.yznaga@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There's no write access to remove. An existing memslot cannot be updated
to set or clear KVM_MEM_READONLY, and any mappings established in a newly
created or moved read-only memslot will already be read-only.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Message-Id: <1591128450-11977-2-git-send-email-anthony.yznaga@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull livepatching updates from Jiri Kosina:
- simplifications and improvements for issues Peter Ziljstra found
during his previous work on W^X cleanups.
This allows us to remove livepatch arch-specific .klp.arch sections
and add proper support for jump labels in patched code.
Also, this patchset removes the last module_disable_ro() usage in the
tree.
Patches from Josh Poimboeuf and Peter Zijlstra
- a few other minor cleanups
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
MAINTAINERS: add lib/livepatch to LIVE PATCHING
livepatch: add arch-specific headers to MAINTAINERS
livepatch: Make klp_apply_object_relocs static
MAINTAINERS: adjust to livepatch .klp.arch removal
module: Make module_enable_ro() static again
x86/module: Use text_mutex in apply_relocate_add()
module: Remove module_disable_ro()
livepatch: Remove module_disable_ro() usage
x86/module: Use text_poke() for late relocations
s390/module: Use s390_kernel_write() for late relocations
s390: Change s390_kernel_write() return type to match memcpy()
livepatch: Prevent module-specific KLP rela sections from referencing vmlinux symbols
livepatch: Remove .klp.arch
livepatch: Apply vmlinux-specific KLP relocations early
livepatch: Disallow vmlinux.ko
Remove KVM_DEBUG_FS, which can easily be misconstrued as controlling
KVM-as-a-host. The sole user of CONFIG_KVM_DEBUG_FS was removed by
commit cfd8983f03 ("x86, locking/spinlocks: Remove ticket (spin)lock
implementation").
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200528031121.28904-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both Intel and AMD support (MPK) Memory Protection Key feature.
Move the feature detection from VMX to the common code. It should
work for both the platforms now.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <158932795627.44260.15144185478040178638.stgit@naples-babu.amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Delay the assignment of array.maxnent to use correct value for the case
cpuid->nent > KVM_MAX_CPUID_ENTRIES.
Fixes: e53c95e8d4 ("KVM: x86: Encapsulate CPUID entries and metadata in struct")
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200604041636.1187-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unconditionally return true when querying the validity of
MSR_IA32_PERF_CAPABILITIES so as to defer the validity check to
intel_pmu_{get,set}_msr(), which can properly give the MSR a pass when
the access is initiated from host userspace. The MSR is emulated so
there is no underlying hardware dependency to worry about.
Fixes: 27461da310 ("KVM: x86/pmu: Support full width counting")
Cc: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200603203303.28545-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After commit 63d0434 ("KVM: x86: move kvm_create_vcpu_debugfs after
last failure point") we are creating the pre-vCPU debugfs files
after the creation of the vCPU file descriptor. This makes it
possible for userspace to reach kvm_vcpu_release before
kvm_create_vcpu_debugfs has finished. The vcpu->debugfs_dentry
then does not have any associated inode anymore, and this causes
a NULL-pointer dereference in debugfs_create_file.
The solution is simply to avoid removing the files; they are
cleaned up when the VM file descriptor is closed (and that must be
after KVM_CREATE_VCPU returns). We can stop storing the dentry
in struct kvm_vcpu too, because it is not needed anywhere after
kvm_create_vcpu_debugfs returns.
Reported-by: syzbot+705f4401d5a93a59b87d@syzkaller.appspotmail.com
Fixes: 63d0434837 ("KVM: x86: move kvm_create_vcpu_debugfs after last failure point")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge more updates from Andrew Morton:
"More mm/ work, plenty more to come
Subsystems affected by this patch series: slub, memcg, gup, kasan,
pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
thp, mmap, kconfig"
* akpm: (131 commits)
arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
riscv: support DEBUG_WX
mm: add DEBUG_WX support
drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
powerpc/mm: drop platform defined pmd_mknotpresent()
mm: thp: don't need to drain lru cache when splitting and mlocking THP
hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
sparc32: register memory occupied by kernel as memblock.memory
include/linux/memblock.h: fix minor typo and unclear comment
mm, mempolicy: fix up gup usage in lookup_node
tools/vm/page_owner_sort.c: filter out unneeded line
mm: swap: memcg: fix memcg stats for huge pages
mm: swap: fix vmstats for huge pages
mm: vmscan: limit the range of LRU type balancing
mm: vmscan: reclaim writepage is IO cost
mm: vmscan: determine anon/file pressure balance at the reclaim root
mm: balance LRU lists based on relative thrashing
mm: only count actual rotations as LRU reclaim cost
...
Extract DEBUG_WX to mm/Kconfig.debug for shared use. Change to use
ARCH_HAS_DEBUG_WX instead of DEBUG_WX defined by arch port.
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/430736828d149df3f5b462d291e845ec690e0141.1587455584.git.zong.li@sifive.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pmd_present() is expected to test positive after pmdp_mknotpresent() as
the PMD entry still points to a valid huge page in memory.
pmdp_mknotpresent() implies that given PMD entry is just invalidated from
MMU perspective while still holding on to pmd_page() referred valid huge
page thus also clearing pmd_present() test. This creates the following
situation which is counter intuitive.
[pmd_present(pmd_mknotpresent(pmd)) = true]
This renames pmd_mknotpresent() as pmd_mkinvalid() reflecting the helper's
functionality more accurately while changing the above mentioned situation
as follows. This does not create any functional change.
[pmd_present(pmd_mkinvalid(pmd)) = true]
This is not applicable for platforms that define own pmdp_invalidate() via
__HAVE_ARCH_PMDP_INVALIDATE. Suggestion for renaming came during a
previous discussion here.
https://patchwork.kernel.org/patch/11019637/
[anshuman.khandual@arm.com: change pmd_mknotvalid() to pmd_mkinvalid() per Will]
Link: http://lkml.kernel.org/r/1587520326-10099-3-git-send-email-anshuman.khandual@arm.com
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Link: http://lkml.kernel.org/r/1584680057-13753-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are multiple similar definitions for arch_clear_hugepage_flags() on
various platforms. Lets just add it's generic fallback definition for
platforms that do not override. This help reduce code duplication.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1588907271-11920-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are multiple similar definitions for is_hugepage_only_range() on
various platforms. Lets just add it's generic fallback definition for
platforms that do not override. This help reduce code duplication.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1588907271-11920-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlb_add_hstate() prints a warning if the hstate already exists. This
was originally done as part of kernel command line parsing. If
'hugepagesz=' was specified more than once, the warning
pr_warn("hugepagesz= specified twice, ignoring\n");
would be printed.
Some architectures want to enable all huge page sizes. They would call
hugetlb_add_hstate for all supported sizes. However, this was done after
command line processing and as a result hstates could have already been
created for some sizes. To make sure no warning were printed, there would
often be code like:
if (!size_to_hstate(size)
hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT)
The only time we want to print the warning is as the result of command
line processing. So, remove the warning from hugetlb_add_hstate and add
it to the single arch independent routine processing "hugepagesz=". After
this, calls to size_to_hstate() in arch specific code can be removed and
hugetlb_add_hstate can be called without worrying about warning messages.
[mike.kravetz@oracle.com: fix hugetlb initialization]
Link: http://lkml.kernel.org/r/4c36c6ce-3774-78fa-abc4-b7346bf24348@oracle.com
Link: http://lkml.kernel.org/r/20200428205614.246260-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Mina Almasry <almasrymina@google.com>
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Acked-by: Will Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Longpeng <longpeng2@huawei.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200417185049.275845-4-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20200428205614.246260-4-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that architectures provide arch_hugetlb_valid_size(), parsing of
"hugepagesz=" can be done in architecture independent code. Create a
single routine to handle hugepagesz= parsing and remove all arch specific
routines. We can also remove the interface hugetlb_bad_size() as this is
no longer used outside arch independent code.
This also provides consistent behavior of hugetlbfs command line options.
The hugepagesz= option should only be specified once for a specific size,
but some architectures allow multiple instances. This appears to be more
of an oversight when code was added by some architectures to set up ALL
huge pages sizes.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sandipan Das <sandipan@linux.ibm.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Mina Almasry <almasrymina@google.com>
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Acked-by: Will Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Longpeng <longpeng2@huawei.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200417185049.275845-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20200428205614.246260-3-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Clean up hugetlb boot command line processing", v4.
Longpeng(Mike) reported a weird message from hugetlb command line
processing and proposed a solution [1]. While the proposed patch does
address the specific issue, there are other related issues in command line
processing. As hugetlbfs evolved, updates to command line processing have
been made to meet immediate needs and not necessarily in a coordinated
manner. The result is that some processing is done in arch specific code,
some is done in arch independent code and coordination is problematic.
Semantics can vary between architectures.
The patch series does the following:
- Define arch specific arch_hugetlb_valid_size routine used to validate
passed huge page sizes.
- Move hugepagesz= command line parsing out of arch specific code and into
an arch independent routine.
- Clean up command line processing to follow desired semantics and
document those semantics.
[1] https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpeng2@huawei.com
This patch (of 3):
The architecture independent routine hugetlb_default_setup sets up the
default huge pages size. It has no way to verify if the passed value is
valid, so it accepts it and attempts to validate at a later time. This
requires undocumented cooperation between the arch specific and arch
independent code.
For architectures that support more than one huge page size, provide a
routine arch_hugetlb_valid_size to validate a huge page size.
hugetlb_default_setup can use this to validate passed values.
arch_hugetlb_valid_size will also be used in a subsequent patch to move
processing of the "hugepagesz=" in arch specific code to a common routine
in arch independent code.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Longpeng <longpeng2@huawei.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200428205614.246260-1-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20200428205614.246260-2-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20200417185049.275845-1-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20200417185049.275845-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using padata during deferred init has only been tested on x86, so for now
limit it to this architecture.
If another arch wants this, it can find the max thread limit that's best
for it and override deferred_page_init_max_threads().
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Josh Triplett <josh@joshtriplett.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200527173608.2885243-8-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
free_area_init_node() is only used by x86 to initialize a memory-less
nodes. Make its name reflect this and drop all the function parameters
except node ID as they are anyway zero.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Cc: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-19-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memmap_init() function was made to iterate over memblock regions and
as the result the early_pfn_in_nid() function became obsolete. Since
CONFIG_NODES_SPAN_OTHER_NODES is only used to pick a stub or a real
implementation of early_pfn_in_nid(), it is also not needed anymore.
Remove both early_pfn_in_nid() and the CONFIG_NODES_SPAN_OTHER_NODES.
Co-developed-by: Hoan Tran <Hoan@os.amperecomputing.com>
Signed-off-by: Hoan Tran <Hoan@os.amperecomputing.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Cc: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-17-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
free_area_init() has effectively became a wrapper for
free_area_init_nodes() and there is no point of keeping it. Still
free_area_init() name is shorter and more general as it does not imply
necessity to initialize multiple nodes.
Rename free_area_init_nodes() to free_area_init(), update the callers and
drop old version of free_area_init().
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-6-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
nodes and zones structures between the systems that have region to node
mapping in memblock and those that don't.
Currently all the NUMA architectures enable this option and for the
non-NUMA systems we can presume that all the memory belongs to node 0 and
therefore the compile time configuration option is not required.
The remaining few architectures that use DISCONTIGMEM without NUMA are
easily updated to use memblock_add_node() instead of memblock_add() and
thus have proper correspondence of memblock regions to NUMA nodes.
Still, free_area_init_node() must have a backward compatible version
because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
different. Once all the architectures will use the new semantics, the
entire compatibility layer can be dropped.
To avoid addition of extra run time memory to store node id for
architectures that keep memblock but have only a single node, the node id
field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
the corresponding accessors presume that in those cases it is always 0.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: rework free_area_init*() funcitons".
After the discussion [1] about removal of CONFIG_NODES_SPAN_OTHER_NODES
and CONFIG_HAVE_MEMBLOCK_NODE_MAP options, I took it a bit further and
updated the node/zone initialization.
Since all architectures have memblock, it is possible to use only the
newer version of free_area_init_node() that calculates the zone and node
boundaries based on memblock node mapping and architectural limits on
possible zone PFNs.
The architectures that still determined zone and hole sizes can be
switched to the generic code and the old code that took those zone and
hole sizes can be simply removed.
And, since it all started from the removal of
CONFIG_NODES_SPAN_OTHER_NODES, the memmap_init() is now updated to iterate
over memblocks and so it does not need to perform early_pfn_to_nid() query
for every PFN.
[1] https://lore.kernel.org/lkml/1585420282-25630-1-git-send-email-Hoan@os.amperecomputing.com
This patch (of 21):
There are several places in the code that directly dereference
memblock_region.nid despite this field being defined only when
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y.
Replace these with calls to memblock_get_region_nid() to improve code
robustness and to avoid possible breakage when
CONFIG_HAVE_MEMBLOCK_NODE_MAP will be removed.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200412194859.12663-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200412194859.12663-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested virtualization
- Nested AMD event injection facelift, building on the rework of generic code
and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page fault
work, will come next week.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl7VJcYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPf6QgAq4wU5wdd1lTGz/i3DIhNVJNJgJlp
ozLzRdMaJbdbn5RpAK6PEBd9+pt3+UlojpFB3gpJh2Nazv2OzV4yLQgXXXyyMEx1
5Hg7b4UCJYDrbkCiegNRv7f/4FWDkQ9dx++RZITIbxeskBBCEI+I7GnmZhGWzuC4
7kj4ytuKAySF2OEJu0VQF6u0CvrNYfYbQIRKBXjtOwuRK4Q6L63FGMJpYo159MBQ
asg3B1jB5TcuGZ9zrjL5LkuzaP4qZZHIRs+4kZsH9I6MODHGUxKonrkablfKxyKy
CFK+iaHCuEXXty5K0VmWM3nrTfvpEjVjbMc7e1QGBQ5oXsDM0pqn84syRg==
=v7Wn
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested
virtualization
- Nested AMD event injection facelift, building on the rework of
generic code and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch
with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host
side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page
fault work, will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits)
KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test
KVM: check userspace_addr for all memslots
KVM: selftests: update hyperv_cpuid with SynDBG tests
x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls
x86/kvm/hyper-v: enable hypercalls regardless of hypercall page
x86/kvm/hyper-v: Add support for synthetic debugger interface
x86/hyper-v: Add synthetic debugger definitions
KVM: selftests: VMX preemption timer migration test
KVM: nVMX: Fix VMX preemption timer migration
x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
KVM: x86/pmu: Support full width counting
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
KVM: x86: acknowledgment mechanism for async pf page ready notifications
KVM: x86: interrupt based APF 'page ready' event delivery
KVM: introduce kvm_read_guest_offset_cached()
KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present()
KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
KVM: VMX: Replace zero-length array with flexible-array
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAl7WhbkTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXlUnB/0R8dBVSeRfNmyJaadBWKFc/LffwKLD
CQ8PVv22ffkCaEYV2tpnhS6NmkERLNdson4Uo02tVUsjOJ4CrWHTn7aKqYWZyA+O
qv/PiD9TBXJVYMVP2kkyaJlK5KoqeAWBr2kM16tT0cxQmlhE7g0Xo2wU9vhRbU+4
i4F0jffe4lWps65TK392CsPr6UEv1HSel191Py5zLzYqChT+L8WfahmBt3chhsV5
TIUJYQvBwxecFRla7yo+4sUn37ZfcTqD1hCWSr0zs4psW0ge7d80kuaNZS+EqxND
fGm3Bp1BlUuDKsJ/D+AaHLCR47PUZ9t9iMDjZS/ovYglLFwi+h3tAV+W
=LwVR
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyper-v updates from Wei Liu:
- a series from Andrea to support channel reassignment
- a series from Vitaly to clean up Vmbus message handling
- a series from Michael to clean up and augment hyperv-tlfs.h
- patches from Andy to clean up GUID usage in Hyper-V code
- a few other misc patches
* tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (29 commits)
Drivers: hv: vmbus: Resolve more races involving init_vp_index()
Drivers: hv: vmbus: Resolve race between init_vp_index() and CPU hotplug
vmbus: Replace zero-length array with flexible-array
Driver: hv: vmbus: drop a no long applicable comment
hyper-v: Switch to use UUID types directly
hyper-v: Replace open-coded variant of %*phN specifier
hyper-v: Supply GUID pointer to printf() like functions
hyper-v: Use UUID API for exporting the GUID (part 2)
asm-generic/hyperv: Add definitions for Get/SetVpRegister hypercalls
x86/hyperv: Split hyperv-tlfs.h into arch dependent and independent files
x86/hyperv: Remove HV_PROCESSOR_POWER_STATE #defines
KVM: x86: hyperv: Remove duplicate definitions of Reference TSC Page
drivers: hv: remove redundant assignment to pointer primary_channel
scsi: storvsc: Re-init stor_chns when a channel interrupt is re-assigned
Drivers: hv: vmbus: Introduce the CHANNELMSG_MODIFYCHANNEL message type
Drivers: hv: vmbus: Synchronize init_vp_index() vs. CPU hotplug
Drivers: hv: vmbus: Remove the unused HV_LOCALIZED channel affinity logic
PCI: hv: Prepare hv_compose_msi_msg() for the VMBus-channel-interrupt-to-vCPU reassignment functionality
Drivers: hv: vmbus: Use a spin lock for synchronizing channel scheduling vs. channel removal
hv_utils: Always execute the fcopy and vss callbacks in a tasklet
...
By far the biggest change in this cycle are the changes that allow much
earlier debug of systems that are hooked up via UART by taking advantage
of the earlycon framework to implement the kgdb I/O hooks before handing
over to the regular polling I/O drivers once they are available. When
discussing Doug's work we also found and fixed an broken
raw_smp_processor_id() sequence in in_dbg_master().
Also included are a collection of much smaller fixes and tweaks: a
couple of tweaks to ged rid of doc gen or coccicheck warnings, future
proof some internal calculations that made implicit power-of-2
assumptions and eliminate some rather weird handling of magic
environment variables in kdb.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl7WfPsACgkQfOMlXTn3
iKGhvBAAmalPhPvJ74djkSfSuz+fNVgjer5wKGQNhz4lSd+0W3lCkY8T2fkUIpL5
jR3Q0gzJSA2WMSA7RrIwegDt0kCiQI0rtRKDkQxo33HBVSLlh2p5oXg7P5lQ4uOi
QZyPI176V1KncFZjPKK2HzhTjoPNlx8GqVys6PBQETvTvxKR3f9qoq5qOKl/f9kQ
Q4Dzb/npl6/XGJnQfdnkRcrXXtlK08yRxfXQyBEv0X6U9PUe1xmEZb1i9WBrrOYv
u6N94fy2z6vqRgnbv4F6FTiQEHR1VFW2nPGpJ6GFv3KGFpT4QSWuyqTjm1Biee2y
Gjn5ACAhW6tdPL+tCK3MRNGih7MaKoR01SnXz5D4T9V1zFTOhW7vyw+t3zoLfR7R
fJoymQWKyfWbtj0Do8POiF31V+hvGVuqhzG/lTpnynSRJL38x4il6sFmtuRxMW+8
vyxaetrPX+omf+fq1ueYTJS5Y5bl1Zp3avajD3VPXq2Vq2m4zl++AOlzTOJDF5A+
P9RbwfWJh5Tm3VdCCWv849IDCK3R15DjoNLsuJkNRzqAYrJMVjA/QWyIAT14KR3z
Nx3ix/QVKFkNnP5g1N38i2AvWRWZ/QuAmAFRgsmgnYPapeeX4EPtgdmqnloV9AAx
CgO7KgUJF4LSIKTfoeWNJ4mpgSVR8zxkOR9w6DX0EQHDbfwlx8o=
=uLAB
-----END PGP SIGNATURE-----
Merge tag 'kgdb-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"By far the biggest change in this cycle are the changes that allow
much earlier debug of systems that are hooked up via UART by taking
advantage of the earlycon framework to implement the kgdb I/O hooks
before handing over to the regular polling I/O drivers once they are
available. When discussing Doug's work we also found and fixed an
broken raw_smp_processor_id() sequence in in_dbg_master().
Also included are a collection of much smaller fixes and tweaks: a
couple of tweaks to ged rid of doc gen or coccicheck warnings, future
proof some internal calculations that made implicit power-of-2
assumptions and eliminate some rather weird handling of magic
environment variables in kdb"
* tag 'kgdb-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Remove the misfeature 'KDBFLAGS'
kdb: Cleanup math with KDB_CMD_HISTORY_COUNT
serial: amba-pl011: Support kgdboc_earlycon
serial: 8250_early: Support kgdboc_earlycon
serial: qcom_geni_serial: Support kgdboc_earlycon
serial: kgdboc: Allow earlycon initialization to be deferred
Documentation: kgdboc: Document new kgdboc_earlycon parameter
kgdb: Don't call the deinit under spinlock
kgdboc: Disable all the early code when kgdboc is a module
kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles
kgdboc: Remove useless #ifdef CONFIG_KGDB_SERIAL_CONSOLE in kgdboc
kgdb: Prevent infinite recursive entries to the debugger
kgdb: Delay "kgdbwait" to dbg_late_init() by default
kgdboc: Use a platform device to handle tty drivers showing up late
Revert "kgdboc: disable the console lock when in kgdb"
kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb
kgdb: Return true in kgdb_nmi_poll_knock()
kgdb: Drop malformed kernel doc comment
kgdb: Fix spurious true from in_dbg_master()
Once upon a time the predecessor of that thing (TEST_VERIFY_AREA)
used to be. However, that had been gone for years now (and
the patch that introduced TEST_ACCESS_OK has not touched any
ifdefs - they got gradually removed later). Just bury it...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Latest edition (039) of "Intel Architecture Instruction Set Extensions
and Future Features Programming Reference" includes three new CPU model
numbers. Linux already has the two Ice Lake server ones. Add the new
model number for Sapphire Rapids.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200603173352.15506-1-tony.luck@intel.com
- Add TPAUSE based delay which allows the CPU to enter an optimized power
state while waiting for the delay to pass. The delay is based on TSC
cycles.
- Add tsc_early_khz command line parameter to workaround the problem that
overclocked CPUs can report the wrong frequency via CPUID.16h which
causes the refined calibration to fail because the delta to the initial
frequency value is too big. With the parameter users can provide an
halfways accurate initial value.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7XvMITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQ59EACWOU2E+S/b+AqKoZRAJWbTASmu2jEU
4AukhjO3A0y+G3EqnCtvQbUbKkthScSmrDJs2Dt8CTO6q3Fqv/f5JgoubgSx9Hbj
pF1hvueOvRBpinzGEJbDbv+HbkoCYr10DZ5dZ8uz120pSnlfSNNpgZ6hJkOFaUHu
nwVEJpkg2x3ZsiJrgyOfdorwbxO5dCNY9YVL3jyVXUi5QfP3lYrr3/Nz6daIRtRn
Q9tj48N4Bk4ASgmg4rSdXd6OKeZ3Oz1nerol5vFvBeaOc8PVcKSu5sSqMIHHUV2M
RJq8T4nW5Y4pkYjpdYP7Pr/3HYbSNW6eU+MycfnJOzYYTIQfFWkG2wHDNuOg/v+A
GC/grS6wNBj/+tZlvWTwLPf44h7V+sowzYPHBWounT/5drFZ+xsm8+Je4s2NtNih
rbG/4oOQ2jn05PNBCCOyLuP33efQ3ub2UHPCoUxckMiX2eqI+iWpdllZLSiSADZY
jlbXgTQ/Fa3nGKVYVDi1GYbx1rBr/HbsbgvGV4D802s7inmev0azrbgc/CECrnvO
rEa501Y1xzxZ7Zet0QvLK/7aKP532pCmgZiBSmcnS73FBbnssvNJiHlAeq4NHtN2
TsaGYLy0iPSj7siXEaeysUKRjUTNNrgRvtWfo35GDjWahgXhixIvVVpwxnWws5cj
aNR5FwxnI03V2A==
=199V
-----END PGP SIGNATURE-----
Merge tag 'x86-timers-2020-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 timer updates from Thomas Gleixner:
"X86 timer specific updates:
- Add TPAUSE based delay which allows the CPU to enter an optimized
power state while waiting for the delay to pass. The delay is based
on TSC cycles.
- Add tsc_early_khz command line parameter to workaround the problem
that overclocked CPUs can report the wrong frequency via CPUID.16h
which causes the refined calibration to fail because the delta to
the initial frequency value is too big. With the parameter users
can provide an halfways accurate initial value"
* tag 'x86-timers-2020-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Add tsc_early_khz command line parameter
x86/delay: Introduce TPAUSE delay
x86/delay: Refactor delay_mwaitx() for TPAUSE support
x86/delay: Preparatory code cleanup
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VPc4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgQkEACnQlzWOfNQMz1AzgUAv/S8IYDJCLrkbjLZ
JK4pJv8Hjhss/7sS+fd8kyKe9VtaZz2IjmrXcC66RMMwtpx4iHnkRffoNAgEdGOl
/M5TCZGhs+F/mp3Lc0WdR5DFHkM6yy2Tkk9wCFLreB4bW67janAWnd7nbU4INqJj
+WqIgpzNMc/kfUhpBYTeQLORhL4e2TG9ADTi/zeUITlpnEsA65LOgXKEpeIFYnSX
KTl4GIZ9tjazG3Y1Eva7DYHDIErNNAtX67KBqf+WBgMV98eB0O6xIPN1WlmhDTqj
FGMLkb8msH1HHntvxDAuc4/ortnUy8vPI4o6zKP89HJJNjIM5p5eHEuVF5JnBw42
Rtu9Om6JqWx51nhAhJNBj9bUStYbhEl0vVQCwbkfPbDJhzTy3RR8z709q9+ZwOrL
xbp4aJBzqrzscjBEiSQbNCf2PyuOAdU0r1x81UN81ZN41d5qUcumcinjw4Y7vru8
z5zMlo1Iy/AWQYyu7jgHmnpI7ZyA/1Qclo5dV7aa72bLFaJa35e7QxgfQOFBA5dY
UZl6QPJRlnB80uGRzD5jCh2O2sQ3XZqYnpaKsUAka1GgbceCp9IC4A5mfZvpACsh
Xk8VXjlhvY/iPJsKLqrh4Oedg4Dj5M3PLL9C3MDfYeIP2qgXpbnk87UV1TPNSpY0
QcTxsXXXIw==
=H+/Z
-----END PGP SIGNATURE-----
Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"On top of the core changes, here are the block driver changes for this
merge window:
- NVMe changes:
- NVMe over Fibre Channel protocol updates, which also reach
over to drivers/scsi/lpfc (James Smart)
- namespace revalidation support on the target (Anthony
Iliopoulos)
- gcc zero length array fix (Arnd Bergmann)
- nvmet cleanups (Chaitanya Kulkarni)
- misc cleanups and fixes (me, Keith Busch, Sagi Grimberg)
- use a SRQ per completion vector (Max Gurtovoy)
- fix handling of runtime changes to the queue count (Weiping
Zhang)
- t10 protection information support for nvme-rdma and
nvmet-rdma (Israel Rukshin and Max Gurtovoy)
- target side AEN improvements (Chaitanya Kulkarni)
- various fixes and minor improvements all over, icluding the
nvme part of the lpfc driver"
- Floppy code cleanup series (Willy, Denis)
- Floppy contention fix (Jiri)
- Loop CONFIGURE support (Martijn)
- bcache fixes/improvements (Coly, Joe, Colin)
- q->queuedata cleanups (Christoph)
- Get rid of ioctl_by_bdev (Christoph, Stefan)
- md/raid5 allocation fixes (Coly)
- zero length array fixes (Gustavo)
- swim3 task state fix (Xu)"
* tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits)
bcache: configure the asynchronous registertion to be experimental
bcache: asynchronous devices registration
bcache: fix refcount underflow in bcache_device_free()
bcache: Convert pr_<level> uses to a more typical style
bcache: remove redundant variables i and n
lpfc: Fix return value in __lpfc_nvme_ls_abort
lpfc: fix axchg pointer reference after free and double frees
lpfc: Fix pointer checks and comments in LS receive refactoring
nvme: set dma alignment to qword
nvmet: cleanups the loop in nvmet_async_events_process
nvmet: fix memory leak when removing namespaces and controllers concurrently
nvmet-rdma: add metadata/T10-PI support
nvmet: add metadata support for block devices
nvmet: add metadata/T10-PI support
nvme: add Metadata Capabilities enumerations
nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len
nvmet: rename nvmet_rw_len to nvmet_rw_data_len
nvmet: add metadata characteristics for a namespace
nvme-rdma: add metadata/T10-PI support
nvme-rdma: introduce nvme_rdma_sgl structure
...
* Add a support of the media keys on the ASUS laptop UX325JA/UX425JA
* ASUS WMI driver can now handle 2-in-1 models T100TA, T100CHI, T100HA, T200TA
* Big refactoring of Intel SCU driver with Elkhart Lake support has been added
* Slim Bootloarder firmware update signaling WMI driver has been added
* Thinkpad ACPI driver can handle dual fan configuration on new P and X models
* Touchscreen DMI driver has been extended to support
- MP-man MPWIN895CL tablet
- ONDA V891 v5 tablet
- techBite Arc 11.6
- Trekstor Twin 10.1
- Trekstor Yourbook C11B
- Vinga J116
* Virtual Button driver got a few fixes to detect mode of 2-in-1 tablet models
* Intel Speed Select tools update
* Plenty of small cleanups here and there
The following is an automated git shortlog grouped by driver:
acerhdf:
- replace space by * in modalias
New drivers:
- Add Elkhart Lake SCU/PMC support
- Add Slim Bootloader firmware update signaling driver
asus-laptop:
- Drop duplicate check for led_classdev_unregister()
asus-nb-wmi:
- Revert "Do not load on Asus T100TA and T200TA"
- Do not load on Asus T100TA and T200TA
asus-wmi:
- Ignore WMI events with code 0x79
- Add support for SW_TABLET_MODE
- Move asus_wmi_input_init and _exit lower in the file
- Drop duplicate check for led_classdev_unregister()
- Reserve more space for struct bias_args
- remove redundant initialization of variable status
dcdbas:
- Check SMBIOS for protected buffer address
dell-laptop:
- don't register micmute LED if there is no token
dell-wmi:
- Ignore keyboard attached / detached events
device property:
- export set_secondary_fwnode() to modules
eeepc-laptop:
- Drop duplicate check for led_classdev_unregister()
hp-wmi:
- Introduce HPWMI_POWER_FW_OR_HW as convenient shortcut
- Convert simple_strtoul() to kstrtou32()
- Refactor postcode_store() to follow standard patterns
intel_cht_int33fe:
- Fix spelling issues
- Switch to use acpi_dev_hid_uid_match()
- Convert to use set_secondary_fwnode()
- Convert software node array to group
intel-hid:
- Add a quirk to support HP Spectre X2 (2015)
intel_mid_powerbtn:
- Convert to use new SCU IPC API
intel_pmc_core:
- avoid unused-function warnings
- Change Jasper Lake S0ix debug reg map back to ICL
intel_pmc_ipc:
- Convert to MFD
- Move PCI IDs to intel_scu_pcidrv.c
- Drop intel_pmc_ipc_command()
- Start using SCU IPC
intel_scu_ipc:
- Add managed function to register SCU IPC
- Introduce new SCU IPC API
- Move legacy SCU IPC API to a separate header
- Log more information if SCU IPC command fails
- Split out SCU IPC functionality from the SCU driver
intel_scu_ipcutil:
- Convert to use new SCU IPC API
intel-speed-select:
- Fix speed-select-base-freq-properties output on CLX-N
intel_telemetry:
- Add telemetry_get_pltdata()
- Convert to use new SCU IPC API
intel-vbtn:
- Only blacklist SW_TABLET_MODE on the 9 / "Laptop" chasis-type
- Detect switch position before registering the input-device
- Move detect_tablet_mode() to higher in the file
- Fix probe failure on devices with only switches
- Also handle tablet-mode switch on "Detachable" and "Portable" chassis-types
- Do not advertise switches to userspace if they are not there
- Split keymap into buttons and switches parts
- Use acpi_evaluate_integer()
ISST:
- Increase timeout
lg-laptop:
- Drop duplicate check for led_classdev_unregister()
MAINTAINERS:
- Add me as maintainer of Intel SCU drivers
- Update entry for Intel Broxton PMC driver
Merges of immutable branches:
- Merge branch 'for-next'
- Merge branch 'ib-mfd-x86-usb-watchdog-v5.7'
- Merge branch 'ib-pdx86-properties'
mfd:
- intel_soc_pmic_mrfld: Convert to use new SCU IPC API
- intel_soc_pmic_bxtwc: Convert to use new SCU IPC API
- intel_soc_pmic: Add SCU IPC member to struct intel_soc_pmic
samsung-laptop:
- Drop duplicate check for led_classdev_unregister()
software node:
- Allow register and unregister software node groups
sony-laptop:
- Make resuming thermal profile safer
- SNC calls should handle BUFFER types
thinkpad_acpi:
- Replace custom approach by kstrtoint()
- Use strndup_user() in dispatch_proc_write()
- Replace next_cmd(&buf) with strsep(&buf, ",")
- Drop duplicate check for led_classdev_unregister()
- Remove always false 'value < 0' statement
- Add support for dual fan control
tools/power/x86/intel-speed-select:
- Fix invalid core mask
- Increase CPU count
- Fix json perf-profile output output
- Update version
- Enable clos for turbo-freq enable
- Fix CLX-N package information output
- Check support status before enable
- Change debug to error
toshiba_acpi:
- Drop duplicate check for led_classdev_unregister()
touchscreen_dmi:
- Update Trekstor Twin 10.1 entry
- Add info for the Trekstor Yourbook C11B
- Drop comma in terminator line
- add Vinga J116 touchscreen
- Add info for the ONDA V891 v5 tablet
- Add touchscreen info for techBite Arc 11.6.
- Add info for the MP-man MPWIN895CL tablet
usb:
- typec: mux: Convert the Intel PMC Mux driver to use new SCU IPC API
watchdog:
- iTCO: fix link error
- intel-mid_wdt: Convert to use new SCU IPC API
wmi:
- Describe function parameters
- Fix indentation in some cases
- Replace UUID redefinitions by their originals
x86/platform/intel-mid:
- Add empty stubs for intel_scu_devices_[create|destroy]()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEqaflIX74DDDzMJJtb7wzTHR8rCgFAl7WCcoACgkQb7wzTHR8
rCi+Pg//dDpMXTxCcXivHZPJHwuAxbwPeJRV9uDKKBSnKqfxyYu37oQf8AQiLTsL
PZOAIiwlrXw0Jd+EH79zN2DyCujBg16B6mf4dx3fMK95OWhPoslofyKRwl8kOBP5
QRZVpuwo6ayKwXV3cyFwWjXyWYJFL7+J3x+jjBmufBsoDJTn9edOCUa3oeHG0BYB
4A91pVKwtfNqqdL/pwd+A9mEZrFJnVilyPRoxTipbpPJqvWQi9dYgb3wHKt/1NM3
xPNd1GQHCI0Of4NGChszY0XdN4SyxFuyLmn1mogYq82r084QA4pLROb0+VFD2npd
DQ4jxJqOwQDtC3gm789OeN6bZ0qnkO9HBwEmzVH7rwiajZxGW7U5rCgNYBahlTgr
gY4kXIBXyOCO2/bItmrSvWDNBvVxD/THCfL4Q/cn6bNTy4TLTHAl2psQcsXIBT6/
Z5SdmHMhxc80eDAOTtSJj0ODeDGvAgbV20n+X260FFAsefDBuXkYMHEaRBf9n2LJ
8k9tauXZ6JdIc4K8/K+BaVl761Okl6PJPMTL7JsFqueHpyzZS7WclCYH5QQ1iN56
10QzddSGp+4HfFFCG2cVkjXG2AnUgT3kQgEOHyLIxp6yKY1PghFXHTEmrLuheYum
jK93qSva5tvvZzy9UejXXsIkDyg76zaIla3rmEEYAmgzPDawR9I=
=pprB
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v5.8-1' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver updates from Andy Shevchenko:
- Add a support of the media keys on the ASUS laptop UX325JA/UX425JA
- ASUS WMI driver can now handle 2-in-1 models T100TA, T100CHI, T100HA,
T200TA
- Big refactoring of Intel SCU driver with Elkhart Lake support has
been added
- Slim Bootloarder firmware update signaling WMI driver has been added
- Thinkpad ACPI driver can handle dual fan configuration on new P and X
models
- Touchscreen DMI driver has been extended to support
- MP-man MPWIN895CL tablet
- ONDA V891 v5 tablet
- techBite Arc 11.6
- Trekstor Twin 10.1
- Trekstor Yourbook C11B
- Vinga J116
- Virtual Button driver got a few fixes to detect mode of 2-in-1 tablet
models
- Intel Speed Select tools update
- Plenty of small cleanups here and there
* tag 'platform-drivers-x86-v5.8-1' of git://git.infradead.org/linux-platform-drivers-x86: (89 commits)
platform/x86: dcdbas: Check SMBIOS for protected buffer address
platform/x86: asus_wmi: Reserve more space for struct bias_args
platform/x86: intel-vbtn: Only blacklist SW_TABLET_MODE on the 9 / "Laptop" chasis-type
platform/x86: intel-hid: Add a quirk to support HP Spectre X2 (2015)
platform/x86: touchscreen_dmi: Update Trekstor Twin 10.1 entry
platform/x86: touchscreen_dmi: Add info for the Trekstor Yourbook C11B
platform/x86: hp-wmi: Introduce HPWMI_POWER_FW_OR_HW as convenient shortcut
platform/x86: hp-wmi: Convert simple_strtoul() to kstrtou32()
platform/x86: hp-wmi: Refactor postcode_store() to follow standard patterns
platform/x86: acerhdf: replace space by * in modalias
platform/x86: ISST: Increase timeout
tools/power/x86/intel-speed-select: Fix invalid core mask
tools/power/x86/intel-speed-select: Increase CPU count
tools/power/x86/intel-speed-select: Fix json perf-profile output output
platform/x86: dell-wmi: Ignore keyboard attached / detached events
platform/x86: dell-laptop: don't register micmute LED if there is no token
platform/x86: thinkpad_acpi: Replace custom approach by kstrtoint()
platform/x86: thinkpad_acpi: Use strndup_user() in dispatch_proc_write()
platform/x86: thinkpad_acpi: Replace next_cmd(&buf) with strsep(&buf, ",")
platform/x86: intel-vbtn: Detect switch position before registering the input-device
...
Merge updates from Andrew Morton:
"A few little subsystems and a start of a lot of MM patches.
Subsystems affected by this patch series: squashfs, ocfs2, parisc,
vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
swap, memcg, pagemap, memory-failure, vmalloc, kasan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
kasan: move kasan_report() into report.c
mm/mm_init.c: report kasan-tag information stored in page->flags
ubsan: entirely disable alignment checks under UBSAN_TRAP
kasan: fix clang compilation warning due to stack protector
x86/mm: remove vmalloc faulting
mm: remove vmalloc_sync_(un)mappings()
x86/mm/32: implement arch_sync_kernel_mappings()
x86/mm/64: implement arch_sync_kernel_mappings()
mm/ioremap: track which page-table levels were modified
mm/vmalloc: track which page-table levels were modified
mm: add functions to track page directory modifications
s390: use __vmalloc_node in stack_alloc
powerpc: use __vmalloc_node in alloc_vm_stack
arm64: use __vmalloc_node in arch_alloc_vmap_stack
mm: remove vmalloc_user_node_flags
mm: switch the test_vmalloc module to use __vmalloc_node
mm: remove __vmalloc_node_flags_caller
mm: remove both instances of __vmalloc_node_flags
mm: remove the prot argument to __vmalloc_node
mm: remove the pgprot argument to __vmalloc
...
Remove fault handling on vmalloc areas, as the vmalloc code now takes
care of synchronizing changes to all page-tables in the system.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200515140023.25469-8-joro@8bytes.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These functions are not needed anymore because the vmalloc and ioremap
mappings are now synchronized when they are created or torn down.
Remove all callers and function definitions.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200515140023.25469-7-joro@8bytes.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the function to sync changes in vmalloc and ioremap ranges to
all page-tables.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200515140023.25469-6-joro@8bytes.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the function to sync changes in vmalloc and ioremap ranges to
all page-tables.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200515140023.25469-5-joro@8bytes.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "decruft the vmalloc API", v2.
Peter noticed that with some dumb luck you can toast the kernel address
space with exported vmalloc symbols.
I used this as an opportunity to decruft the vmalloc.c API and make it
much more systematic. This also removes any chance to create vmalloc
mappings outside the designated areas or using executable permissions
from modules. Besides that it removes more than 300 lines of code.
This patch (of 29):
Use the designated helper for allocating executable kernel memory, and
remove the now unused PAGE_KERNEL_RX define.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200414131348.444715-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200414131348.444715-2-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page table entry is passed in the 'val' argument to note_page(),
however this was previously an "unsigned long" which is fine on 64-bit
platforms. But for 32 bit x86 it is not always big enough to contain a
page table entry which may be 64 bits.
Change the type to u64 to ensure that it is always big enough.
[akpm@linux-foundation.org: fix riscv]
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200521152308.33096-3-steven.price@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Fix W+X debug feature on x86"
Jan alerted me[1] that the W+X detection debug feature was broken in x86
by my change[2] to switch x86 to use the generic ptdump infrastructure.
Fundamentally the approach of trying to move the calculation of
effective permissions into note_page() was broken because note_page() is
only called for 'leaf' entries and the effective permissions are passed
down via the internal nodes of the page tree. The solution I've taken
here is to create a new (optional) callback which is called for all
nodes of the page tree and therefore can calculate the effective
permissions.
Secondly on some configurations (32 bit with PAE) "unsigned long" is not
large enough to store the table entries. The fix here is simple - let's
just use a u64.
[1] https://lore.kernel.org/lkml/d573dc7e-e742-84de-473d-f971142fa319@suse.com/
[2] 2ae27137b2 ("x86: mm: convert dump_pagetables to use walk_page_range")
This patch (of 2):
By switching the x86 page table dump code to use the generic code the
effective permissions are no longer calculated correctly because the
note_page() function is only called for *leaf* entries. To calculate
the actual effective permissions it is necessary to observe the full
hierarchy of the page tree.
Introduce a new callback for ptdump which is called for every entry and
can therefore update the prot_levels array correctly. note_page() can
then simply access the appropriate element in the array.
[steven.price@arm.com: make the assignment conditional on val != 0]
Link: http://lkml.kernel.org/r/430c8ab4-e7cd-6933-dde6-087fac6db872@arm.com
Fixes: 2ae27137b2 ("x86: mm: convert dump_pagetables to use walk_page_range")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200521152308.33096-1-steven.price@arm.com
Link: http://lkml.kernel.org/r/20200521152308.33096-2-steven.price@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs updates from Al Viro:
"Assorted patches from Miklos.
An interesting part here is /proc/mounts stuff..."
The "/proc/mounts stuff" is using a cursor for keeeping the location
data while traversing the mount listing.
Also probably worth noting is the addition of faccessat2(), which takes
an additional set of flags to specify how the lookup is done
(AT_EACCESS, AT_SYMLINK_NOFOLLOW, AT_EMPTY_PATH).
* 'from-miklos' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: add faccessat2 syscall
vfs: don't parse "silent" option
vfs: don't parse "posixacl" option
vfs: don't parse forbidden flags
statx: add mount_root
statx: add mount ID
statx: don't clear STATX_ATIME on SB_RDONLY
uapi: deprecate STATX_ALL
utimensat: AT_EMPTY_PATH support
vfs: split out access_override_creds()
proc/mounts: add cursor
aio: fix async fsync creds
vfs: allow unprivileged whiteout creation
Pull uaccess/coredump updates from Al Viro:
"set_fs() removal in coredump-related area - mostly Christoph's
stuff..."
* 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
binfmt_elf_fdpic: remove the set_fs(KERNEL_DS) in elf_fdpic_core_dump
binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump
binfmt_elf: remove the set_fs in fill_siginfo_note
signal: refactor copy_siginfo_to_user32
powerpc/spufs: simplify spufs core dumping
powerpc/spufs: stop using access_ok
powerpc/spufs: fix copy_to_user while atomic
Pull uaccess/csum updates from Al Viro:
"Regularize the sitation with uaccess checksum primitives:
- fold csum_partial_... into csum_and_copy_..._user()
- on x86 collapse several access_ok()/stac()/clac() into
user_access_begin()/user_access_end()"
* 'uaccess.csum' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
default csum_and_copy_to_user(): don't bother with access_ok()
take the dummy csum_and_copy_from_user() into net/checksum.h
arm: switch to csum_and_copy_from_user()
sh32: convert to csum_and_copy_from_user()
m68k: convert to csum_and_copy_from_user()
xtensa: switch to providing csum_and_copy_from_user()
sparc: switch to providing csum_and_copy_from_user()
parisc: turn csum_partial_copy_from_user() into csum_and_copy_from_user()
alpha: turn csum_partial_copy_from_user() into csum_and_copy_from_user()
ia64: turn csum_partial_copy_from_user() into csum_and_copy_from_user()
ia64: csum_partial_copy_nocheck(): don't abuse csum_partial_copy_from_user()
x86: switch 32bit csum_and_copy_to_user() to user_access_{begin,end}()
x86: switch both 32bit and 64bit to providing csum_and_copy_from_user()
x86_64: csum_..._copy_..._user(): switch to unsafe_..._user()
get rid of csum_partial_copy_to_user()
set from Mauro toward the completion of the RST conversion. I *really*
hope we are getting close to the end of this. Meanwhile, those patches
reach pretty far afield to update document references around the tree;
there should be no actual code changes there. There will be, alas, more of
the usual trivial merge conflicts.
Beyond that we have more translations, improvements to the sphinx
scripting, a number of additions to the sysctl documentation, and lots of
fixes.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl7VId8PHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Yq/gH/iaDgirQZV6UZ2v9sfwQNYolNpf2sKAuOZjd
bPFB7WJoMQbKwQEvYrAUL2+5zPOcLYuIfzyOfo1BV1py+EyKbACcKjI4AedxfJF7
+NchmOBhlEqmEhzx2U08HRc4/8J223WG17fJRVsV3p+opJySexSFeQucfOciX5NR
RUCxweWWyg/FgyqjkyMMTtsePqZPmcT5dWTlVXISlbWzcv5NFhuJXnSrw8Sfzcmm
SJMzqItv3O+CabnKQ8kMLV2PozXTMfjeWH47ZUK0Y8/8PP9+cvqwFzZ0UDQJ1Xaz
oyW/TqmunaXhfMsMFeFGSwtfgwRHvXdxkQdtwNHvo1dV4dzTvDw=
=fDC/
-----END PGP SIGNATURE-----
Merge tag 'docs-5.8' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"A fair amount of stuff this time around, dominated by yet another
massive set from Mauro toward the completion of the RST conversion. I
*really* hope we are getting close to the end of this. Meanwhile,
those patches reach pretty far afield to update document references
around the tree; there should be no actual code changes there. There
will be, alas, more of the usual trivial merge conflicts.
Beyond that we have more translations, improvements to the sphinx
scripting, a number of additions to the sysctl documentation, and lots
of fixes"
* tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits)
Documentation: fixes to the maintainer-entry-profile template
zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst
tracing: Fix events.rst section numbering
docs: acpi: fix old http link and improve document format
docs: filesystems: add info about efivars content
Documentation: LSM: Correct the basic LSM description
mailmap: change email for Ricardo Ribalda
docs: sysctl/kernel: document unaligned controls
Documentation: admin-guide: update bug-hunting.rst
docs: sysctl/kernel: document ngroups_max
nvdimm: fixes to maintainter-entry-profile
Documentation/features: Correct RISC-V kprobes support entry
Documentation/features: Refresh the arch support status files
Revert "docs: sysctl/kernel: document ngroups_max"
docs: move locking-specific documents to locking/
docs: move digsig docs to the security book
docs: move the kref doc into the core-api book
docs: add IRQ documentation at the core-api book
docs: debugging-via-ohci1394.txt: add it to the core-api book
docs: fix references for ipmi.rst file
...
- Branch Target Identification (BTI)
* Support for ARMv8.5-BTI in both user- and kernel-space. This
allows branch targets to limit the types of branch from which
they can be called and additionally prevents branching to
arbitrary code, although kernel support requires a very recent
toolchain.
* Function annotation via SYM_FUNC_START() so that assembly
functions are wrapped with the relevant "landing pad"
instructions.
* BPF and vDSO updates to use the new instructions.
* Addition of a new HWCAP and exposure of BTI capability to
userspace via ID register emulation, along with ELF loader
support for the BTI feature in .note.gnu.property.
* Non-critical fixes to CFI unwind annotations in the sigreturn
trampoline.
- Shadow Call Stack (SCS)
* Support for Clang's Shadow Call Stack feature, which reserves
platform register x18 to point at a separate stack for each
task that holds only return addresses. This protects function
return control flow from buffer overruns on the main stack.
* Save/restore of x18 across problematic boundaries (user-mode,
hypervisor, EFI, suspend, etc).
* Core support for SCS, should other architectures want to use it
too.
* SCS overflow checking on context-switch as part of the existing
stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
- CPU feature detection
* Removed numerous "SANITY CHECK" errors when running on a system
with mismatched AArch32 support at EL1. This is primarily a
concern for KVM, which disabled support for 32-bit guests on
such a system.
* Addition of new ID registers and fields as the architecture has
been extended.
- Perf and PMU drivers
* Minor fixes and cleanups to system PMU drivers.
- Hardware errata
* Unify KVM workarounds for VHE and nVHE configurations.
* Sort vendor errata entries in Kconfig.
- Secure Monitor Call Calling Convention (SMCCC)
* Update to the latest specification from Arm (v1.2).
* Allow PSCI code to query the SMCCC version.
- Software Delegated Exception Interface (SDEI)
* Unexport a bunch of unused symbols.
* Minor fixes to handling of firmware data.
- Pointer authentication
* Add support for dumping the kernel PAC mask in vmcoreinfo so
that the stack can be unwound by tools such as kdump.
* Simplification of key initialisation during CPU bringup.
- BPF backend
* Improve immediate generation for logical and add/sub
instructions.
- vDSO
- Minor fixes to the linker flags for consistency with other
architectures and support for LLVM's unwinder.
- Clean up logic to initialise and map the vDSO into userspace.
- ACPI
- Work around for an ambiguity in the IORT specification relating
to the "num_ids" field.
- Support _DMA method for all named components rather than only
PCIe root complexes.
- Minor other IORT-related fixes.
- Miscellaneous
* Initialise debug traps early for KGDB and fix KDB cacheflushing
deadlock.
* Minor tweaks to early boot state (documentation update, set
TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
* Refactoring and cleanup
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7U9csQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNLBHCACs/YU4SM7Om5f+7QnxIKao5DBr2CnGGvdC
yTfDghFDTLQVv3MufLlfno3yBe5G8sQpcZfcc+hewfcGoMzVZXu8s7LzH6VSn9T9
jmT3KjDMrg0RjSHzyumJp2McyelTk0a4FiKArSIIKsJSXUyb1uPSgm7SvKVDwEwU
JGDzL9IGilmq59GiXfDzGhTZgmC37QdwRoRxDuqtqWQe5CHoRXYexg87HwBKOQxx
HgU9L7ehri4MRZfpyjaDrr6quJo3TVnAAKXNBh3mZAskVS9ZrfKpEH0kYWYuqybv
znKyHRecl/rrGePV8RTMtrwnSdU26zMXE/omsVVauDfG9hqzqm+Q
=w3qi
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"A sizeable pile of arm64 updates for 5.8.
Summary below, but the big two features are support for Branch Target
Identification and Clang's Shadow Call stack. The latter is currently
arm64-only, but the high-level parts are all in core code so it could
easily be adopted by other architectures pending toolchain support
Branch Target Identification (BTI):
- Support for ARMv8.5-BTI in both user- and kernel-space. This allows
branch targets to limit the types of branch from which they can be
called and additionally prevents branching to arbitrary code,
although kernel support requires a very recent toolchain.
- Function annotation via SYM_FUNC_START() so that assembly functions
are wrapped with the relevant "landing pad" instructions.
- BPF and vDSO updates to use the new instructions.
- Addition of a new HWCAP and exposure of BTI capability to userspace
via ID register emulation, along with ELF loader support for the
BTI feature in .note.gnu.property.
- Non-critical fixes to CFI unwind annotations in the sigreturn
trampoline.
Shadow Call Stack (SCS):
- Support for Clang's Shadow Call Stack feature, which reserves
platform register x18 to point at a separate stack for each task
that holds only return addresses. This protects function return
control flow from buffer overruns on the main stack.
- Save/restore of x18 across problematic boundaries (user-mode,
hypervisor, EFI, suspend, etc).
- Core support for SCS, should other architectures want to use it
too.
- SCS overflow checking on context-switch as part of the existing
stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
CPU feature detection:
- Removed numerous "SANITY CHECK" errors when running on a system
with mismatched AArch32 support at EL1. This is primarily a concern
for KVM, which disabled support for 32-bit guests on such a system.
- Addition of new ID registers and fields as the architecture has
been extended.
Perf and PMU drivers:
- Minor fixes and cleanups to system PMU drivers.
Hardware errata:
- Unify KVM workarounds for VHE and nVHE configurations.
- Sort vendor errata entries in Kconfig.
Secure Monitor Call Calling Convention (SMCCC):
- Update to the latest specification from Arm (v1.2).
- Allow PSCI code to query the SMCCC version.
Software Delegated Exception Interface (SDEI):
- Unexport a bunch of unused symbols.
- Minor fixes to handling of firmware data.
Pointer authentication:
- Add support for dumping the kernel PAC mask in vmcoreinfo so that
the stack can be unwound by tools such as kdump.
- Simplification of key initialisation during CPU bringup.
BPF backend:
- Improve immediate generation for logical and add/sub instructions.
vDSO:
- Minor fixes to the linker flags for consistency with other
architectures and support for LLVM's unwinder.
- Clean up logic to initialise and map the vDSO into userspace.
ACPI:
- Work around for an ambiguity in the IORT specification relating to
the "num_ids" field.
- Support _DMA method for all named components rather than only PCIe
root complexes.
- Minor other IORT-related fixes.
Miscellaneous:
- Initialise debug traps early for KGDB and fix KDB cacheflushing
deadlock.
- Minor tweaks to early boot state (documentation update, set
TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
- Refactoring and cleanup"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
KVM: arm64: Check advertised Stage-2 page size capability
arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
ACPI/IORT: Remove the unused __get_pci_rid()
arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
arm64/cpufeature: Introduce ID_MMFR5 CPU register
arm64/cpufeature: Introduce ID_DFR1 CPU register
arm64/cpufeature: Introduce ID_PFR2 CPU register
arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
arm64: mm: Add asid_gen_match() helper
firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
arm64: vdso: Fix CFI directives in sigreturn trampoline
arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
...
functionality intended.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VNX4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g43A/+K+TCmm8+G0DSL5JSHiI93J9yu9ac3yEU
4V9eOxcrQVEPqZUEgGNl8yucMXsTj+trT1J0ZKygoVYzpzFSsJzeyQ97CfNa25x4
AIKrVewkSBtLS4Fof1jfSgapWlY54OldMWfLNXInMPxekD0gCRhIp2hmidxwZouX
fyMsZGw9YjEPNfzHDjfADymRLOVJHG3rpd8hjrbNLblMR+xaleLHezFwn7+6PgXl
FaENy3MVubziTOWr5AT39xG3zKide1boeDI/eszD1pFu4DeBc5/7u8tYglhqGj/i
qCoojXUJxxEK/NRFO0zSXKG9vb1ZLKERRFmPbD4xbfgPPKMQRFGf2JcSfF6HuK/o
reay1MWMIapD2E3TSoJAcLaKIk/Z8nEzVXhff3bmU5Zskbhprgqz/8LblyfNdJZ3
SlnnQxpfnc+Up36EU6yk42Dy2x9IW7Ew04rWVuWzF7VixbVlKfK8MpjNSAhyduuO
6rs0YnIW2PIt7cjskrT5HEAvUVFzd2EaY327+L9fb56Mrb3fzg1T2ihVnzAs9r2s
GoYuPL9uFnHZS19MclRq8In7dFviypeL9IX9FcBCaGuqGlWdSIahLW8OyT9tOqIw
Wn7bpSHz8GM9OZIBs3u6PDE7qwPQkRTFoJzt8H5PtcUQbZSwOOOV4CjvCG4xQ14c
j+xSwhXUpxg=
=Qrqk
-----END PGP SIGNATURE-----
Merge tag 'x86-vdso-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso updates from Ingo Molnar:
"Clean up various aspects of the vDSO code, no change in functionality
intended"
* tag 'x86-vdso-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso/Makefile: Add vobjs32
x86/vdso/vdso2c: Convert iterators to unsigned
x86/vdso/vdso2c: Correct error messages on file open
it removes unnecessary functions and cleans up the rest.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VNO4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i61RAAogLRVi4ga4vmTk5SqUqtR4pupbHJv5IM
IjkQN0HZ3+Oi6kRxwuOQ9xOOzQWm8GntkZeyN5FA73H7x+bYdU12MIKKTEDcW3xp
Mg9FtzfeL0V4YmNkmlnIXycyYA3nBdSxnI/OL/58J9CLT15qXYkWjyvkbI2aJ3qL
U8xM5cTTvhoARjd43o0eAfekTg0XdUAsgvO0vOM5+I1HrQP8SR3ZIFaMSR+MfAQx
Nbz/UVUSDJ8BNzmS/CfFLFm0F2dkphlLC0r6eAOFZAYSIax0bRVklxV9qdScEQMK
bkVKXGanCzVTBVM1HXDycLJaILlqcS18tK+VqNIAR5x2BXmaSG8jqwCW4NM0tcaN
c5zemNsqnAH/VzxeFjE2BcDQnA1nkgj75Vm9O81HMQfyqR16M5pBzRXY/qBqslya
vX5wLoD962BiVtbELqW6v+Ot29xMYlCLLlTbLHaWQraJS3TjuAvL0/sOFdWgs63F
N7a+BLvikfYoKCS8IxW87BFBysy9nhv/4UwdaX5RpIQ1wgx/EJLDaowrM+L6Dzmw
bhQ3AgRZGNZCBDm3uGU/LigTTxN93h5KqKnuUKv3H+tNvEKxPKEAqPyvL8fO8G0U
BTJiM/XRIzQrkrmwCKqON1iKRjKB2fKklxiq4REIoSqKfeiQ3SVBIHENFnH96Ekf
G9qptwFEZYI=
=L8Vy
-----END PGP SIGNATURE-----
Merge tag 'x86-platform-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Ingo Molnar:
"This tree cleans up various aspects of the UV platform support code,
it removes unnecessary functions and cleans up the rest"
* tag 'x86-platform-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/uv: Remove code for unused distributed GRU mode
x86/platform/uv: Remove the unused _uv_cpu_blade_processor_id() macro
x86/platform/uv: Unexport uv_apicid_hibits
x86/platform/uv: Remove _uv_hub_info_check()
x86/platform/uv: Simplify uv_send_IPI_one()
x86/platform/uv: Mark uv_min_hub_revision_id static
x86/platform/uv: Mark is_uv_hubless() static
x86/platform/uv: Remove the UV*_HUB_IS_SUPPORTED macros
x86/platform/uv: Unexport symbols only used by x2apic_uv_x.c
x86/platform/uv: Unexport sn_coherency_id
x86/platform/uv: Remove the uv_partition_coherence_id() macro
x86/platform/uv: Mark uv_bios_call() and uv_bios_call_irqsave() static
which is a feature that allows kernel-only data to be automatically
saved/restored by the FPU context switching code.
CPU features that can be supported this way are Intel PT, 'PASID' and
CET features.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VMZgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jmAQ/7BJpyAHUjFJdChtkvUmLcBgI2qnxP7rc8
Eh/tSo4PKh484Uqb4WY6XAHIAPBzEt3rHJG3fdaavzlUl98YJCdD9tstfwMPcCQ4
L4c2Ru+h+mPQCMOZUctOphPjDzGWPzR4IhceH6gqhoS4vg9EqgN4o158x4jW6KFN
Jlocp9CMfIaGSmaMlRrIUZ4Dj3mgboqqHsuCaibtaKAMK6LqZQDViTEal4mNbESX
KQPOFpKrhoq6Jtzzer7fLPY2qb6kkLrL03X5IUGFP5UxigSejnfrI9SZpAuPP9S0
kdN04Jo0T2aBIAikBTVhDWdLMJk19qeu7YXBrFEVbyhZHl1HdDqOhMdWPOp1GH9W
CtGUalbIvz/5FbXuUImiiNh/bw2FxYjHsrDguW96IvMVFteucrFg9QyL+taYb1cV
WqWdpIC0VoMuQxQI5FBWu4Bb/cLNV9VCxWAZjZQ806kwmyDxldsw5mucMGmH3+bO
LD6bwRShSMRzI9bzcJSG+Z3y7Fe8b5IGNjCjzgPb88ezffBEFHzIEKdCL6QTNlRF
6UgSGbRs41SqXwNw5tdQQNwPpDO73p+KVRGoEzyMJvojLKRGTcOHHUDriGZ30MNX
3oHvLf5+dNrLC/frbOqUmQ7doBQOplR5VxlZVwwqkdpPw13Jf5zn4ewzriTOmKCq
mEHMQmbkyi4=
=M+BC
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Ingo Molnar:
"Most of the changes here related to 'XSAVES supervisor state' support,
which is a feature that allows kernel-only data to be automatically
saved/restored by the FPU context switching code.
CPU features that can be supported this way are Intel PT, 'PASID' and
CET features"
* tag 'x86-fpu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu/xstate: Restore supervisor states for signal return
x86/fpu/xstate: Preserve supervisor states for the slow path in __fpu__restore_sig()
x86/fpu: Introduce copy_supervisor_to_kernel()
x86/fpu/xstate: Update copy_kernel_to_xregs_err() for supervisor states
x86/fpu/xstate: Update sanitize_restored_xstate() for supervisor xstates
x86/fpu/xstate: Define new functions for clearing fpregs and xstates
x86/fpu/xstate: Introduce XSAVES supervisor states
x86/fpu/xstate: Separate user and supervisor xfeatures mask
x86/fpu/xstate: Define new macros for supervisor and user xstates
x86/fpu/xstate: Rename validate_xstate_header() to validate_user_xstate_header()
- Extend the x86 family/model macros with a steppings dimension,
because x86 life isn't complex enough and Intel uses steppings to
differentiate between different CPUs. :-/
- Convert the TSC deadline timer quirks to the steppings macros.
- Clean up asm mnemonics.
- Fix the handling of an AMD erratum, or in other words, fix a kernel erratum.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VL2wRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hxYQ//dic//qQ+4GOL1wP1Qj8EiGOzaWdynoia
oDi7Si1I9vy58iCCRgkQKmxEwfnHcM5+NC6/091S4BD2IE6o+iD1YhPsGZK8DT4Y
FmeD8pgtx5LMJFMBe6KRyek1s0JblP6v0Q0BwUk7YtV6k0oSP+f/2n5BGj2+P7YH
3Iw438M5JhIrzVp3PnCgJoZkSm9iRnZqbBtR8nd2SO+vx8M75cX27LL6fdaCypRj
wH9w6+J2NhAZStmEv54LKOdO5RAPJjvatbTZFMEFdceAGFEbHPJIees7paoC+DTP
3BuhzF/9ghDNKly6Zz3PtyNNDP1vglZ1W9dJkCfTXUWlZKbQV94Yk+JbP5mndxqn
+f3eD/dInofHiCeAh1Sfj3BCGdOSjgFMBB57CKkCy4LehXwJ9C2eBcbxd4XMfEkd
h0EywZrp1L10AxDHtq5x82xf1fwfTDyvlYmJrBshXfiitaySn+mPVJMuj3wvqJSP
WKbJS4HfkekIaf9WoUA+Ay6FJdY7nNirViRrQEZVmDPTV0EDfcaNM5p6Ttkja3Ph
VoVa8Ms8FRqTfh6xCfckYR+vI44U+AFNLM6YFyetGYc0yVXNzg3vLy2DbqLRolWy
t1upDdNf1TMJg4BaMrBzZgDg/uI2BM3jeOj69U0cboO2JhJjxjl3qPeiYDKD50MK
Z1Nho933894=
=QKjn
-----END PGP SIGNATURE-----
Merge tag 'x86-cpu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Ingo Molnar:
"Misc updates:
- Extend the x86 family/model macros with a steppings dimension,
because x86 life isn't complex enough and Intel uses steppings to
differentiate between different CPUs. :-/
- Convert the TSC deadline timer quirks to the steppings macros.
- Clean up asm mnemonics.
- Fix the handling of an AMD erratum, or in other words, fix a kernel
erratum"
* tag 'x86-cpu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Use RDRAND and RDSEED mnemonics in archrandom.h
x86/cpu: Use INVPCID mnemonic in invpcid.h
x86/cpu/amd: Make erratum #1054 a legacy erratum
x86/apic: Convert the TSC deadline timer matching to steppings macro
x86/cpu: Add a X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS() macro
x86/cpu: Add a steppings field to struct x86_cpu_id
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VLcQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iFnhAArGBqco3C2RPQugv7UDDbKEaMvxOGrc5B
kwnyOS/k/yeIkfhT9u11oBuLcaj/Zgw8YCjFyRfaNsorRqnytLyZzZ6PvdCCE3YU
X3DVYgulcdAQnM4bS2e3Kt9ciJvFxB27XNm0AfuyLMUxMqCD+iIO4gJ6TuQNBYy3
dfUMfB1R9OUDW13GCrASe+p1Dw76uaqVngdFWJhnC8Rm49E6gFXq7CLQp5Cka81I
KZeJ8I6ug9p3gqhOIXdi+S6g5CM5jf86Wkk7dOHwHFH7CceFb3FIz7z0n1je4Wgd
L5rYX7+PwfNeZ73GIuvEBN+agJH2K0H/KmnlWNWeZHzc+J12MeruSdSMBIkBOEpn
iSbYAOmDpQLzBjTdZjC8bDqTZf472WrTh4VwN9NxHLucjdC+IqGoTAvnyyEOmZ5o
R7sv7Q++316CVwRhYVXbzwZcqtiinCDE1EkP5nKTo9z3z0kMF5+ce/k7wn5sgZIk
zJq3LXtaToiDoDRAPGxcvFPts9MdC0EI1aKTIjaK/n6i2h/SpJfrTKgANWaldYTe
XJIqlSB43saqf5YAQ3/sY+wnpCRBmmCU+sfKja4C8bH7RuggI3mZS19uhFs0Qctq
Yx5bIXVSBAIqjJtgzQ0WAAZ5LrCpNNyAzb35ZYefQlGyJlx1URKXVBmxa6S99biU
KiYX7Dk5uhQ=
=0ZQd
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups, with an emphasis on removing obsolete/dead code"
* tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/spinlock: Remove obsolete ticket spinlock macros and types
x86/mm: Drop deprecated DISCONTIGMEM support for 32-bit
x86/apb_timer: Drop unused declaration and macro
x86/apb_timer: Drop unused TSC calibration
x86/io_apic: Remove unused function mp_init_irq_at_boot()
x86/mm: Stop printing BRK addresses
x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()
x86/nmi: Remove edac.h include leftover
mm: Remove MPX leftovers
x86/mm/mmap: Fix -Wmissing-prototypes warnings
x86/early_printk: Remove unused includes
crash_dump: Remove no longer used saved_max_pfn
x86/smpboot: Remove the last ICPU() macro
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VK9cRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ibgw//acOg/6o7HzHS19nEDfRf2grtipPq0lZN
laIBlGNQdyQHoTMbvF4X8hE1VuALdcr+kVCXirvHnTVsE62fqR8KzdTeEPHHSamy
VWZkaOGq+jZiJnM4EZ1j6y0E6Cf9SWU2Zho4Ov/j88s3aYhkYG6EU+8dZMpI2pLU
EqZAqzuZ8lJYDchv+Xbd/dN3p8DoCzbcZ5nJN+mDaHiVruLB3fk3cqBjAhAbvYFO
X2Fk4yNccvHWjGbBNbgoddTRt/ZHC+PhiIGvE+KzcDLZipjUj4M7WxznLGdILFT/
Vpys3Uewa64bQk/GURuxh7A/IjzqohCKq0pLugU3B1FW6nASCUuySbN8KroIiGo8
Vnesc6G4G+KtxJGq18/umSaDoX9RmNM7iyeGt2G3yyV5MFPz83XZmtCVHizY6ayk
PPDB1lPXks3NpdKBgH/SYDfm7GBI3CwH7ttr3+DSl8nfadfIjQtu5hnhdBLeGWj4
AVhWSTyaLfABkRoU+DEg9YbzvcywjNOp0sblIxhxFiPKECymhNdBmljQmW6EMTRg
j1El5pdYp0D+MNyBTewgD033yMm5pLsHZX+aiyG5ULizevemjWrnprzFYFnSYBZY
ivfRnsK7zzWh+cejJJiZKPPR4RDu+VNneCd2PWjqX6VwPd03QjmOI8zw7WeLSbZl
kzzhOThwvdo=
=idS6
-----END PGP SIGNATURE-----
Merge tag 'x86-build-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Ingo Molnar:
"Misc dependency fixes, plus a documentation update about memory
protection keys support"
* tag 'x86-build-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/Kconfig: Update config and kernel doc for MPK feature on AMD
x86/boot: Discard .discard.unreachable for arch/x86/boot/compressed/vmlinux
x86/boot/build: Add phony targets in arch/x86/boot/Makefile to PHONY
x86/boot/build: Make 'make bzlilo' not depend on vmlinux or $(obj)/bzImage
x86/boot/build: Add cpustr.h to targets and remove clean-files
- Add the initrdmem= boot option to specify an initrd embedded in RAM (flash most likely)
- Sanitize the CS value earlier during boot, which also fixes SEV-ES.
- Various fixes and smaller cleanups.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VKk8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i25w/6A8okusHJMXyXMYddRHNiL57x3DcTRsTO
09Wz7e0YrL53HqQEyaqtSam/0VqgSaHDQb/gRb2Ci0G+XzZ3BFYvICVWTW6NcvnA
VSUoHC8Mr83Aq3UfAEcJZZ0bHNuoKymO256v2tZPGCSGgZoxQdoe4/6W1uMxxjLr
NFpeyAm93zTe1+MmA/ZcFxH+xOZPYVPhl7+KgO3muMH/hGoS3Dt+RCuB9VHTgMvf
4mN6IxN3cVHDogt7usdtWjgrYnhY0SjiWo858+MDWsrW5oXifsXLJ5jJr1Ea1nGx
qqVyaCqAVNobOkpsBLHg1DiD/rr9A4sfS/etmAjWsPO6kAx9Mq9+B2DG5fTU/gB+
zd76M3Jl3wyjdy6hPMyiZGlFFM9l3efyp/iYPhFWgPqVlkkOvbO+9FWVDbFtErQw
WpEG2d8KHN4+ph8D04ExeKJKCKaYnAaHKk13fZnjjeQhatyGGAYn6hx+rT/x+onM
2CeRG/+KcnlzKgXqYX6/YT++XlaCKgMntO/FdLT99/4CD92rqQdhwJ6JNH1U8nXO
LWjrV5ZH6R3n5Hr5+J/Kcd9/kIfAqWG3t/eiTEPEjJIUWXEdhBoQWErSce4on5a7
6eBfkKEQxIYAdC1iO2uoKEtEpMDvFWoIIVjdlVTFiJ8Np9uvv7lPByr/0TJ+N5b7
fgOrzglWuxo=
=U/uh
-----END PGP SIGNATURE-----
Merge tag 'x86-boot-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
"Misc updates:
- Add the initrdmem= boot option to specify an initrd embedded in RAM
(flash most likely)
- Sanitize the CS value earlier during boot, which also fixes SEV-ES
- Various fixes and smaller cleanups"
* tag 'x86-boot-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Correct relocation destination on old linkers
x86/boot/compressed/64: Switch to __KERNEL_CS after GDT is loaded
x86/boot: Fix -Wint-to-pointer-cast build warning
x86/boot: Add kstrtoul() from lib/
x86/tboot: Mark tboot static
x86/setup: Add an initrdmem= option to specify initrd physical address
- preliminary changes for RISC-V
- Add support for setting the resolution on the EFI framebuffer
- Simplify kernel image loading for arm64
- Move .bss into .data via the linker script instead of relying on symbol
annotations.
- Get rid of __pure getters to access global variables
- Clean up the config table matching arrays
- Rename pr_efi/pr_efi_err to efi_info/efi_err, and use them consistently
- Simplify and unify initrd loading
- Parse the builtin command line on x86 (if provided)
- Implement printk() support, including support for wide character strings
- Simplify GDT handling in early mixed mode thunking code
- Some other minor fixes and cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VANERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iLnhAApADFVx2r/PmBTaLkxTILnyC0zg03kWne
Fs6K/npgK68M/Qz4OlXqVhirCHTVMDO4T/4hfckSe0HtzLiGtPeKwea7+ATpdeff
iH0k1xCOO9YoUAGKLpOwNPIzR3F2EEJy0vENF/v6KFODuBNJE8Xuq6GAFs9IKoxz
zUxFJw/QlKssr4GcxpgW5ODb2rwiP4znyDj/x6/oy81H+RPk+TuDrF/kmHmTQJVN
MLZsx8oXQUmgZabDd8xOzQh41KWy8CpF3XbrOA+zGiT8oiRjwTJ49Mo1p53Wm0Ba
xSxxXrPvJAT8OrZaeDCdppn0p0OLAl4fuTkAgURZKwAHJLgiERY0N0EFGoDYriVZ
qQuBTVuVa3Njg7IKdMyXjyKqX7xmEP+6Mck9j4bg8Q/ss4T4UzpkOxe7txCc/Bqw
lf3rzx8IvXxh5ep5H+rqcWfjfygdZrv6hJ0xVkwt1C43pVHWMXlrE+IJqvvQC2q4
KyEkNdGdFFeWiGVo829kuOIVXNFapcSe6G+Q9aCZSLa7LCi+bSNZmi8HzPhiQEr0
t/Z04DUDYgJ3mUfZuKJ7i1FHbuYo7t1iDwx9txeIM2u1S8E68UmgwLHRg2xLjtyu
Rkib57mTTi6pO8cSJQVNaSh5F3h870nhrK62kfCgn8c8B5v/C9q56Q0B3hlUIvQ4
R0T8188LdYk=
=XG4E
-----END PGP SIGNATURE-----
Merge tag 'efi-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
"The EFI changes for this cycle are:
- preliminary changes for RISC-V
- Add support for setting the resolution on the EFI framebuffer
- Simplify kernel image loading for arm64
- Move .bss into .data via the linker script instead of relying on
symbol annotations.
- Get rid of __pure getters to access global variables
- Clean up the config table matching arrays
- Rename pr_efi/pr_efi_err to efi_info/efi_err, and use them
consistently
- Simplify and unify initrd loading
- Parse the builtin command line on x86 (if provided)
- Implement printk() support, including support for wide character
strings
- Simplify GDT handling in early mixed mode thunking code
- Some other minor fixes and cleanups"
* tag 'efi-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
efi/x86: Don't blow away existing initrd
efi/x86: Drop the special GDT for the EFI thunk
efi/libstub: Add missing prototype for PE/COFF entry point
efi/efivars: Add missing kobject_put() in sysfs entry creation error path
efi/libstub: Use pool allocation for the command line
efi/libstub: Don't parse overlong command lines
efi/libstub: Use snprintf with %ls to convert the command line
efi/libstub: Get the exact UTF-8 length
efi/libstub: Use %ls for filename
efi/libstub: Add UTF-8 decoding to efi_puts
efi/printf: Add support for wchar_t (UTF-16)
efi/gop: Add an option to list out the available GOP modes
efi/libstub: Add definitions for console input and events
efi/libstub: Implement printk-style logging
efi/printf: Turn vsprintf into vsnprintf
efi/printf: Abort on invalid format
efi/printf: Refactor code to consolidate padding and output
efi/printf: Handle null string input
efi/printf: Factor out integer argument retrieval
efi/printf: Factor out width/precision parsing
...
- Add AMD Fam17h RAPL support
- Introduce CAP_PERFMON to kernel and user space
- Add Zhaoxin CPU support
- Misc fixes and cleanups
Tooling changes:
perf record:
- Introduce --switch-output-event to use arbitrary events to be setup
and read from a side band thread and, when they take place a signal
be sent to the main 'perf record' thread, reusing the --switch-output
code to take perf.data snapshots from the --overwrite ring buffer, e.g.:
# perf record --overwrite -e sched:* \
--switch-output-event syscalls:*connect* \
workload
will take perf.data.YYYYMMDDHHMMSS snapshots up to around the
connect syscalls.
- Add --num-synthesize-threads option to control degree of parallelism of the
synthesize_mmap() code which is scanning /proc/PID/task/PID/maps and can be
time consuming. This mimics pre-existing behaviour in 'perf top'.
perf bench:
- Add a multi-threaded synthesize benchmark.
- Add kallsyms parsing benchmark.
Intel PT support:
- Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.
- Allow using Intel PT to synthesize callchains for regular events.
- Add support for synthesizing branch stacks for regular events (cycles,
instructions, etc) from Intel PT data.
Misc changes:
- Updated perf vendor events for power9 and Coresight.
- Add flamegraph.py script via 'perf flamegraph'
- Misc other changes, fixes and cleanups - see the Git log for details.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VJAcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hAYw/8DFtzGkMaaWkrDSj62LXtWQiqr1l01ZFt
9GzV4aN4/go+K4BQtsQN8cUjOkRHFnOryLuD9LfSBfqsdjuiyTynV/cJkeUGQBck
TT/GgWf3XKJzTUBRQRk367Gbqs9UKwBP8CdFhOXcNzGEQpjhbwwIDPmem94U4L1N
XLsysgC45ejWL1kMTZKmk6hDIidlFeDg9j70WDPX1nNfCeisk25rxwTpdgvjsjcj
3RzPRt2EGS+IkuF4QSCT5leYSGaCpVDHCQrVpHj57UoADfWAyC71uopTLG4OgYSx
PVd9gvloMeeqWmroirIxM67rMd/TBTfVekNolhnQDjqp60Huxm+gGUYmhsyjNqdx
Pb8HRZCBAudei9Ue4jNMfhCRK2Ug1oL5wNvN1xcSteAqrwMlwBMGHWns6l12x0ks
BxYhyLvfREvnKijXc1o8D5paRgqohJgfnHlrUZeacyaw5hQCbiVRpwg0T1mWAF53
u9hfWLY0Oy+Qs2C7EInNsWSYXRw8oPQNTFVx2I968GZqsEn4DC6Pt3ovWrDKIDnz
ugoZJQkJ3/O8stYSMiyENehdWlo575NkapCTDwhLWnYztrw4skqqHE8ighU/e8ug
o/Kx7ANWN9OjjjQpq2GVUeT0jCaFO+OMiGMNEkKoniYgYjogt3Gw5PeedBMtY07p
OcWTiQZamjU=
=i27M
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Add AMD Fam17h RAPL support
- Introduce CAP_PERFMON to kernel and user space
- Add Zhaoxin CPU support
- Misc fixes and cleanups
Tooling changes:
- perf record:
Introduce '--switch-output-event' to use arbitrary events to be
setup and read from a side band thread and, when they take place a
signal be sent to the main 'perf record' thread, reusing the core
for '--switch-output' to take perf.data snapshots from the ring
buffer used for '--overwrite', e.g.:
# perf record --overwrite -e sched:* \
--switch-output-event syscalls:*connect* \
workload
will take perf.data.YYYYMMDDHHMMSS snapshots up to around the
connect syscalls.
Add '--num-synthesize-threads' option to control degree of
parallelism of the synthesize_mmap() code which is scanning
/proc/PID/task/PID/maps and can be time consuming. This mimics
pre-existing behaviour in 'perf top'.
- perf bench:
Add a multi-threaded synthesize benchmark and kallsyms parsing
benchmark.
- Intel PT support:
Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.
Allow using Intel PT to synthesize callchains for regular events.
Add support for synthesizing branch stacks for regular events
(cycles, instructions, etc) from Intel PT data.
Misc changes:
- Updated perf vendor events for power9 and Coresight.
- Add flamegraph.py script via 'perf flamegraph'
- Misc other changes, fixes and cleanups - see the Git log for details
Also, since over the last couple of years perf tooling has matured and
decoupled from the kernel perf changes to a large degree, going
forward Arnaldo is going to send perf tooling changes via direct pull
requests"
* tag 'perf-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (163 commits)
perf/x86/rapl: Add AMD Fam17h RAPL support
perf/x86/rapl: Make perf_probe_msr() more robust and flexible
perf/x86/rapl: Flip logic on default events visibility
perf/x86/rapl: Refactor to share the RAPL code between Intel and AMD CPUs
perf/x86/rapl: Move RAPL support to common x86 code
perf/core: Replace zero-length array with flexible-array
perf/x86: Replace zero-length array with flexible-array
perf/x86/intel: Add more available bits for OFFCORE_RESPONSE of Intel Tremont
perf/x86/rapl: Add Ice Lake RAPL support
perf flamegraph: Use /bin/bash for report and record scripts
perf cs-etm: Move definition of 'traceid_list' global variable from header file
libsymbols kallsyms: Move hex2u64 out of header
libsymbols kallsyms: Parse using io api
perf bench: Add kallsyms parsing
perf: cs-etm: Update to build with latest opencsd version.
perf symbol: Fix kernel symbol address display
perf inject: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
perf annotate: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
perf trace: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
perf script: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
...
- Speed up objtool significantly, especially when there are large number of sections
- Improve objtool's understanding of special instructions such as IRET,
to reduce the number of annotations required
- Implement 'noinstr' validation
- Do baby steps for non-x86 objtool use
- Simplify/fix retpoline decoding
- Add vmlinux validation
- Improve documentation
- Fix various bugs and apply smaller cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VHvcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gEfBAAhvPWljUmfQsetYq4q9BdbuC4xPSQN9ra
e+2zu1MQaohkjAdFM1boNVhCCGKFUvlTEEw3GJR141Us6Y/ZRS8VIo70tmVSku6I
OwuR5i8SgEKwurr1SwLxrI05rovYWRLSaDIRTHn2CViPEjgriyFGRV8QKam3AYmI
dx47la3ELwuQR68nIdIMzDRt49oZVy+ZKW8Pjgjklzrd5KMYsPy7HPtraHUMeDg+
GdoC7RresIt5AFiDiIJzKTT/jROI7KuHFnM6blluKHoKenWhYBFCz3sd6IvCdQWX
JGy+KKY6H+YDMSpgc4FRP56M3GI0hX14oCd7L72epSLfOuzPr9Tmf6wfyQ8f50Je
LGLD47tyltIcQR9H85YdR8UQspkjSW6xcql4ByCPTEqp0UzSGTsVntvsHzwsgz6A
Csh3s+DVdv0rk5ZjMCu8STA2oErpehJm7fmugt2oLx+nsCNCBUI25lilw5JGeq5c
+cO0IRxRxHPeRvMWvItTjbixVAHOHYlB00ilDbvsm+GnTJgu/5cMqpXdLvfXI2Rr
nl360bSS3t3J4w5rX0mXw4x24vjQmVrA69jU+oo8RSHje2X8Y4Q7sFHNjmN0YAI3
Re8aP6HSLQjioJxGz9aISlrxmPOXe0CMp8JE586SREVgmS/olXtidMgi7l12uZ2B
cRdtNYcn31U=
=dbCU
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
"There are a lot of objtool changes in this cycle, all across the map:
- Speed up objtool significantly, especially when there are large
number of sections
- Improve objtool's understanding of special instructions such as
IRET, to reduce the number of annotations required
- Implement 'noinstr' validation
- Do baby steps for non-x86 objtool use
- Simplify/fix retpoline decoding
- Add vmlinux validation
- Improve documentation
- Fix various bugs and apply smaller cleanups"
* tag 'objtool-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
objtool: Enable compilation of objtool for all architectures
objtool: Move struct objtool_file into arch-independent header
objtool: Exit successfully when requesting help
objtool: Add check_kcov_mode() to the uaccess safelist
samples/ftrace: Fix asm function ELF annotations
objtool: optimize add_dead_ends for split sections
objtool: use gelf_getsymshndx to handle >64k sections
objtool: Allow no-op CFI ops in alternatives
x86/retpoline: Fix retpoline unwind
x86: Change {JMP,CALL}_NOSPEC argument
x86: Simplify retpoline declaration
x86/speculation: Change FILL_RETURN_BUFFER to work with objtool
objtool: Add support for intra-function calls
objtool: Move the IRET hack into the arch decoder
objtool: Remove INSN_STACK
objtool: Make handle_insn_ops() unconditional
objtool: Rework allocating stack_ops on decode
objtool: UNWIND_HINT_RET_OFFSET should not check registers
objtool: is_fentry_call() crashes if call has no destination
x86,smap: Fix smap_{save,restore}() alternatives
...
- RCU-tasks update, including addition of RCU Tasks Trace for
BPF use and TASKS_RUDE_RCU
- kfree_rcu() updates.
- Remove scheduler locking restriction
- RCU CPU stall warning updates.
- Torture-test updates.
- Miscellaneous fixes and other updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7U/r0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hSNxAAirKhPGBoLI9DW1qde4OFhZg+BlIpS+LD
IE/0eGB8hGwhb1793RGbzIJfSnRQpSOPxWbWc6DJZ4Zpi5/ZbVkiPKsuXpM1xGxs
kuBCTOhWy1/p3iCZ1JH/JCrCAdWGZkIzEoaV7ipnHtV/+UrRbCWH5PB7R0fYvcbI
q5bUcWJyEp/bYMxQn8DhAih6SLPHx+F9qaGAqqloLSHstTYG2HkBhBGKnqcd/Jex
twkLK53poCkeP/c08V1dyagU2IRWj2jGB1NjYh/Ocm+Sn/vru15CVGspjVjqO5FF
oq07lad357ddMsZmKoM2F5DhXbOh95A+EqF9VDvIzCvfGMUgqYI1oxWF4eycsGhg
/aYJgYuN23YeEe2DkDzJB67GvBOwl4WgdoFaxKRzOiCSfrhkM8KqM4G9Fz1JIepG
abRJCF85iGcLslU9DkrShQiDsd/CRPzu/jz6ybK0I2II2pICo6QRf76T7TdOvKnK
yXwC6OdL7/dwOht20uT6XfnDXMCWI4MutiUrb8/C1DbaihwEaI2denr3YYL+IwrB
B38CdP6sfKZ5UFxKh0xb+sOzWrw0KA+ThSAXeJhz3tKdxdyB6nkaw3J9lFg8oi20
XGeAujjtjMZG5cxt2H+wO9kZY0RRau/nTqNtmmRrCobd5yJjHHPHH8trEd0twZ9A
X5Wjh11lv3E=
=Yisx
-----END PGP SIGNATURE-----
Merge tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
"The RCU updates for this cycle were:
- RCU-tasks update, including addition of RCU Tasks Trace for BPF use
and TASKS_RUDE_RCU
- kfree_rcu() updates.
- Remove scheduler locking restriction
- RCU CPU stall warning updates.
- Torture-test updates.
- Miscellaneous fixes and other updates"
* tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
rcu: Allow for smp_call_function() running callbacks from idle
rcu: Provide rcu_irq_exit_check_preempt()
rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()
rcu: Provide __rcu_is_watching()
rcu: Provide rcu_irq_exit_preempt()
rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
rcu/tree: Mark the idle relevant functions noinstr
x86: Replace ist_enter() with nmi_enter()
x86/mce: Send #MC singal from task work
x86/entry: Get rid of ist_begin/end_non_atomic()
sched,rcu,tracing: Avoid tracing before in_nmi() is correct
sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
lockdep: Always inline lockdep_{off,on}()
hardirq/nmi: Allow nested nmi_enter()
arm64: Prepare arch_nmi_enter() for recursion
printk: Disallow instrumenting print_nmi_enter()
printk: Prepare for nested printk_nmi_enter()
rcutorture: Convert ULONG_CMP_LT() to time_before()
torture: Add a --kasan argument
torture: Save a few lines by using config_override_param initially
...
their width from CPUID. As a prerequsite, streamline and unify the CPUID
detection of the respective resource control attributes. By Reinette
Chatre.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7VNQYACgkQEsHwGGHe
VUqFgw//Qr6Ot21wtZAgVSCOPJ7rWH4aR6Bbq9YN0/kUGk2PbAjfFK3aMekHd4u6
+cTc1eroOiWc5mXurbJ6gfmDgTYHj09DII/qXIi9wxWHPyAZ3cBltGXG9A86lIHZ
hyA6l3Z+RzhnIKoATbXzaEy1nEoGMgCsezuGmN2d1KM2Fl8dmYHT+88iMDSazNb3
D51+HHvUF03+h41A11mw7Omknycpk3wOmNX0A+t55EExW8xrYjjenY+suWvhDoGj
+uqDVhYjxvVziI0lsAfcDfN3R3MuPMBXJXtwf9qaZODGs4ttTg/lyHRcTqGBVwrf
xGuDw0qRLvDKy9UC09H5F7ArScx9X8bOu9NFjNA2AZyLLdhVpTa5AuX1T7VroRkD
rcDx++EEjVEX36wBoGbwrfZTcaL3er8sPVZiXG1dDc5/GqWfP+abx4G1JPG9xRYH
V4oghEsBAJXRfGt07PTNSTqDWu/dQWmMUIVfGWX97l+5ED+HlL24MVoThkkH6f2M
7uAj8IRsbgzn207nZD8DNorARcQVsK2VvzgZzbqa0VsArt57Xk8DSZpfsqw8B7OJ
OwuA/0S6nFQX9g6C0eLR68WPwv6YrLjKMEO7KJukNbaN/pcVGKXaJ/dSTwpsNQbC
JZeAuc7DSYJ3Iv+xh64DvpL3hvb46J1UMRCWoVFw3ZQcnGV6t0E=
=JeAm
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache resource control updates from Borislav Petkov:
"Add support for wider Memory Bandwidth Monitoring counters by querying
their width from CPUID.
As a prerequsite for that, streamline and unify the CPUID detection of
the respective resource control attributes.
By Reinette Chatre"
* tag 'x86_cache_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Support wider MBM counters
x86/resctrl: Support CPUID enumeration of MBM counter width
x86/resctrl: Maintain MBM counter width per resource
x86/resctrl: Query LLC monitoring properties once during boot
x86/resctrl: Remove unnecessary RMID checks
x86/cpu: Move resctrl CPUID code to resctrl/
x86/resctrl: Rename asm/resctrl_sched.h to asm/resctrl.h
value from stop_machine(), from Mihai Carabas.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7UxoEACgkQEsHwGGHe
VUrQdxAArdz2s/EtyhFeacKGp6nFG8wHyMVFbwJ/6ZzKcaX6zVc4PI/5O1837ls7
rFMZt5r21kxfLB3/wMAOZGI+ZP7i6IXzcwBI5/BmS+YK+t3PqWeT+iTNo8hr9tI/
d8Xly4sE/CIrPZduZPnNVsrRdzqKDs/KMnnPTxZWVNDWMVOKHJZtJ2Ty8eHZsgwl
b4yBL1JiZHELSb9SrMhZfortogB2eSUaFABWYJMhGJ8XHQ6AZ+A3EB+he/9Zu3Wu
Giz4LvnhCGJyhTLaDHRUhMfLHo1knl6LNS6QNqVSP82TKRlX3AVeDnHST968BeTr
ronLTvOVkkZcpvk5ukeSqcBFhxiio9R1rUbkfZlYPt63m/6uWCiMzzOGXB+JTtYc
5of95CXehYj41XlQVkQtJJmoysYdt7JJw0w5+Cr3Uuov/RKOEiCdrgemOxOmIcM+
YJ8m+lTn95+8PXFjg/kvweZA7rXr1HcPhfmd9tCMha2k6b1MbdaMT3xb+m1vGXD/
BRojkuqf7OK19T/Owcum6A/oBmjuNjPZPL5HapQ9ZbMz6AZ3InRmaU+8EvQLIer7
iimQYWzTTdlZsresJh2+itPMf1EVyHVRnzFlx/N1BMhAxpR2aYXwGA5WKxk10p7U
80iejJntiNwXJmCHXXiQ55Dyii0vZykJv2FbGjLF4xUUGB/zxW8=
=g4rq
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode update from Borislav Petkov:
"A single fix for late microcode loading to handle the correct return
value from stop_machine(), from Mihai Carabas"
* tag 'x86_microcode_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Fix return value for microcode late loading
Pull crypto updates from Herbert Xu:
"API:
- Introduce crypto_shash_tfm_digest() and use it wherever possible.
- Fix use-after-free and race in crypto_spawn_alg.
- Add support for parallel and batch requests to crypto_engine.
Algorithms:
- Update jitter RNG for SP800-90B compliance.
- Always use jitter RNG as seed in drbg.
Drivers:
- Add Arm CryptoCell driver cctrng.
- Add support for SEV-ES to the PSP driver in ccp"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
crypto: hisilicon - fix driver compatibility issue with different versions of devices
crypto: engine - do not requeue in case of fatal error
crypto: cavium/nitrox - Fix a typo in a comment
crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
crypto: hisilicon/qm - add debugfs to the QM state machine
crypto: hisilicon/qm - add debugfs for QM
crypto: stm32/crc32 - protect from concurrent accesses
crypto: stm32/crc32 - don't sleep in runtime pm
crypto: stm32/crc32 - fix multi-instance
crypto: stm32/crc32 - fix run-time self test issue.
crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
crypto: hisilicon/zip - Use temporary sqe when doing work
crypto: hisilicon - add device error report through abnormal irq
crypto: hisilicon - remove codes of directly report device errors through MSI
crypto: hisilicon - QM memory management optimization
crypto: hisilicon - unify initial value assignment into QM
...
There is another mode for the synthetic debugger which uses hypercalls
to send/recv network data instead of the MSR interface.
This interface is much slower and less recommended since you might get
a lot of VMExits while KDVM polling for new packets to recv, rather
than simply checking the pending page to see if there is data avialble
and then request.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Jon Doron <arilou@gmail.com>
Message-Id: <20200529134543.1127440-6-arilou@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Microsoft's kdvm.dll dbgtransport module does not respect the hypercall
page and simply identifies the CPU being used (AMD/Intel) and according
to it simply makes hypercalls with the relevant instruction
(vmmcall/vmcall respectively).
The relevant function in kdvm is KdHvConnectHypervisor which first checks
if the hypercall page has been enabled via HV_X64_MSR_HYPERCALL_ENABLE,
and in case it was not it simply sets the HV_X64_MSR_GUEST_OS_ID to
0x1000101010001 which means:
build_number = 0x0001
service_version = 0x01
minor_version = 0x01
major_version = 0x01
os_id = 0x00 (Undefined)
vendor_id = 1 (Microsoft)
os_type = 0 (A value of 0 indicates a proprietary, closed source OS)
and starts issuing the hypercall without setting the hypercall page.
To resolve this issue simply enable hypercalls also if the guest_os_id
is not 0.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Jon Doron <arilou@gmail.com>
Message-Id: <20200529134543.1127440-5-arilou@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support for Hyper-V synthetic debugger (syndbg) interface.
The syndbg interface is using MSRs to emulate a way to send/recv packets
data.
The debug transport dll (kdvm/kdnet) will identify if Hyper-V is enabled
and if it supports the synthetic debugger interface it will attempt to
use it, instead of trying to initialize a network adapter.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Jon Doron <arilou@gmail.com>
Message-Id: <20200529134543.1127440-4-arilou@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V synthetic debugger has two modes, one that uses MSRs and
the other that use Hypercalls.
Add all the required definitions to both types of synthetic debugger
interface.
Some of the required new CPUIDs and MSRs are not documented in the TLFS
so they are in hyperv.h instead.
The reason they are not documented is because they are subjected to be
removed in future versions of Windows.
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Jon Doron <arilou@gmail.com>
Message-Id: <20200529134543.1127440-3-arilou@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a nested VM with a VMX-preemption timer is migrated, verify that the
nested VM and its parent VM observe the VMX-preemption timer exit close to
the original expiration deadline.
Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200526215107.205814-3-makarandsonare@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add new field to hold preemption timer expiration deadline
appended to struct kvm_vmx_nested_state_hdr. This is to prevent
the first VM-Enter after migration from incorrectly restarting the timer
with the full timer value instead of partially decayed timer value.
KVM_SET_NESTED_STATE restarts timer using migrated state regardless
of whether L1 sets VM_EXIT_SAVE_VMX_PREEMPTION_TIMER.
Fixes: cf8b84f48a ("kvm: nVMX: Prepare for checkpointing L2 state")
Signed-off-by: Peter Shier <pshier@google.com>
Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Message-Id: <20200526215107.205814-2-makarandsonare@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel CPUs have a new alternative MSR range (starting from MSR_IA32_PMC0)
for GP counters that allows writing the full counter width. Enable this
range from a new capability bit (IA32_PERF_CAPABILITIES.FW_WRITE[bit 13]).
The guest would query CPUID to get the counter width, and sign extends
the counter values as needed. The traditional MSRs always limit to 32bit,
even though the counter internally is larger (48 or 57 bits).
When the new capability is set, use the alternative range which do not
have these restrictions. This lowers the overhead of perf stat slightly
because it has to do less interrupts to accumulate the counter value.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20200529074347.124619-3-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated
field from the struct could be used by get_msr. This also makes this API
consistent with kvm_pmu_set_msr. No functional changes.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Message-Id: <20200529074347.124619-2-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce new capability to indicate that KVM supports interrupt based
delivery of 'page ready' APF events. This includes support for both
MSR_KVM_ASYNC_PF_INT and MSR_KVM_ASYNC_PF_ACK.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-8-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If two page ready notifications happen back to back the second one is not
delivered and the only mechanism we currently have is
kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
only be performed with the next vmexit when it happens and in some cases
it may take a while. With interrupt based page ready notification delivery
the situation is even worse: unlike exceptions, interrupts are not handled
immediately so we must check if the slot is empty. This is slow and
unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
the fact that the slot is free and host should check its notification
queue. Mandate using it for interrupt based 'page ready' APF event
delivery.
As kvm_check_async_pf_completion() is going away from vcpu_run() we need
a way to communicate the fact that vcpu->async_pf.done queue has
transitioned from empty to non-empty state. Introduce
kvm_arch_async_page_present_queued() and KVM_REQ_APF_READY to do the job.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-7-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Concerns were expressed around APF delivery via synthetic #PF exception as
in some cases such delivery may collide with real page fault. For 'page
ready' notifications we can easily switch to using an interrupt instead.
Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the legacy one.
One notable difference between the two mechanisms is that interrupt may not
get handled immediately so whenever we would like to deliver next event
(regardless of its type) we must be sure the guest had read and cleared
previous event in the slot.
While on it, get rid on 'type 1/type 2' names for APF events in the
documentation as they are causing confusion. Use 'page not present'
and 'page ready' everywhere instead.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An innocent reader of the following x86 KVM code:
bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
{
if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
return true;
...
may get very confused: if APF mechanism is not enabled, why do we report
that we 'can inject async page present'? In reality, upon injection
kvm_arch_async_page_present() will check the same condition again and,
in case APF is disabled, will just drop the item. This is fine as the
guest which deliberately disabled APF doesn't expect to get any APF
notifications.
Rename kvm_arch_can_inject_async_page_present() to
kvm_arch_can_dequeue_async_page_present() to make it clear what we are
checking: if the item can be dequeued (meaning either injected or just
dropped).
On s390 kvm_arch_can_inject_async_page_present() always returns 'true' so
the rename doesn't matter much.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, APF mechanism relies on the #PF abuse where the token is being
passed through CR2. If we switch to using interrupts to deliver page-ready
notifications we need a different way to pass the data. Extent the existing
'struct kvm_vcpu_pv_apf_data' with token information for page-ready
notifications.
While on it, rename 'reason' to 'flags'. This doesn't change the semantics
as we only have reasons '1' and '2' and these can be treated as bit flags
but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery
making 'reason' name misleading.
The newly introduced apf_put_user_ready() temporary puts both flags and
token information, this will be changed to put token only when we switch
to interrupt based notifications.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 9a6e7c3981 (""KVM: async_pf: Fix #DF due to inject "Page not
Present" and "Page Ready" exceptions simultaneously") added a protection
against 'page ready' notification coming before 'page not present' is
delivered. This situation seems to be impossible since commit 2a266f2355
("KVM MMU: check pending exception before injecting APF) which added
'vcpu->arch.exception.pending' check to kvm_can_do_async_pf.
On x86, kvm_arch_async_page_present() has only one call site:
kvm_check_async_pf_completion() loop and we only enter the loop when
kvm_arch_can_inject_async_page_present(vcpu) which when async pf msr
is enabled, translates into kvm_can_do_async_pf().
There is also one problem with the cancellation mechanism. We don't seem
to check that the 'page not present' notification we're canceling matches
the 'page ready' notification so in theory, we may erroneously drop two
valid events.
Revert the commit.
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Message-Id: <20200507185618.GA14831@embeddedor>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to VMX, the state that is captured through the currently available
IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was
running at the moment when the process was interrupted to save its state.
In particular, the SVM-specific state for nested virtualization includes
the L1 saved state (including the interrupt flag), the cached L2 controls,
and the GIF.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This allows fetching the registers from the hsave area when setting
up the NPT shadow MMU, and is needed for KVM_SET_NESTED_STATE (which
runs long after the CR0, CR4 and EFER values in vcpu have been switched
to hold L2 guest state).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to the AMD manual, the effect of turning off EFER.SVME while a
guest is running is undefined. We make it leave guest mode immediately,
similar to the effect of clearing the VMX bit in MSR_IA32_FEAT_CTL.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The authoritative state does not come from the VMCB once in guest mode,
but KVM_SET_NESTED_STATE can still perform checks on L1's provided SVM
controls because we get them from userspace.
Therefore, split out a function to do them.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The L1 flags can be found in the save area of svm->nested.hsave, fish
it from there so that there is one fewer thing to migrate.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that the int_ctl field is stored in svm->nested.ctl.int_ctl, we can
use it instead of vcpu->arch.hflags to check whether L2 is running
in V_INTR_MASKING mode.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This bit was added to nested VMX right when nested_run_pending was
introduced, but it is not yet there in nSVM. Since we can have pending
events that L0 injected directly into L2 on vmentry, we have to transfer
them into L1's queue.
For this to work, one important change is required: svm_complete_interrupts
(which clears the "injected" fields from the previous VMRUN, and updates them
from svm->vmcb's EXITINTINFO) must be placed before we inject the vmexit.
This is not too scary though; VMX even does it in vmx_vcpu_run.
While at it, the nested_vmexit_inject tracepoint is moved towards the
end of nested_svm_vmexit. This ensures that the synthesized EXITINTINFO
is visible in the trace.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is only one GIF flag for the whole processor, so make sure it is not clobbered
when switching to L2 (in which case we also have to include the V_GIF_ENABLE_MASK,
lest we confuse enable_gif/disable_gif/gif_set). When going back, L1 could in
theory have entered L2 without issuing a CLGI so make sure the svm_set_gif is
done last, after svm->vmcb->control.int_ctl has been copied back from hsave.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extract the code that is needed to implement CLGI and STGI,
so that we can run it from VMRUN and vmexit (and in the future,
KVM_SET_NESTED_STATE). Skip the request for KVM_REQ_EVENT unless needed,
subsuming the evaluate_pending_interrupts optimization that is found
in enter_svm_guest_mode.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_vcpu_apicv_active must be false when nested virtualization is enabled,
so there is no need to check it in clgi_interception.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The control state changes on every L2->L0 vmexit, and we will have to
serialize it in the nested state. So keep it up to date in svm->nested.ctl
and just copy them back to the nested VMCB in nested_svm_vmexit.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Restore the INT_CTL value from the guest's VMCB once we've stopped using
it, so that virtual interrupts can be injected as requested by L1.
V_TPR is up-to-date however, and it can change if the guest writes to CR8,
so keep it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation for nested SVM save/restore, store all data that matters
from the VMCB control area into svm->nested. It will then become part
of the nested SVM state that is saved by KVM_SET_NESTED_STATE and
restored by KVM_GET_NESTED_STATE, just like the cached vmcs12 for nVMX.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow placing the VMCB structs on the stack or in other structs without
wasting too much space. Add BUILD_BUG_ON as a quick safeguard against typos.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use l1_tsc_offset to compute svm->vcpu.arch.tsc_offset and
svm->vmcb->control.tsc_offset, instead of relying on hsave.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split out filling svm->vmcb.save and svm->vmcb.control before VMRUN.
Only the latter will be useful when restoring nested SVM state.
This patch introduces no semantic change, so the MMU setup is still
done in nested_prepare_vmcb_save. The next patch will clean up things.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When restoring SVM nested state, the control state cache in svm->nested
will have to be filled, but the save state will not have to be moved
into svm->vmcb. Therefore, pull the code that handles the control area
into a separate function.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unmapping the nested VMCB in enter_svm_guest_mode is a bit of a wart,
since the map argument is not used elsewhere in the function. There are
just two callers, and those are also the place where kvm_vcpu_map is
called, so it is cleaner to unmap there.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.
The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.
Signed-off-by: David S. Miller <davem@davemloft.net>
- Prevent a memory leak in ioperm which was caused by the stupid
assumption that the exit cleanup is always called for current, which is
not the case when fork fails after taking a reference on the ioperm
bitmap.
- Fix an arithmething overflow in the DMA code on 32bit systems
- Fill gaps in the xstate copy with defaults instead of leaving them
uninitialized
- Revert: o"Make __X32_SYSCALL_BIT be unsigned long" as it turned out
that existing user space fails to build.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7Tt8YTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobtTEACukhGsuivgiTwltWuHcATqrcNbgHSu
nnhuQrjJ8KJiF5O60nDztPAVzxD+Ww2tzuDnD1BLFDI9cEA5oPhzXf7kUuJvrYUK
INY+OALPPpw2iWjmygIsEyw3Pzmnm6peRA4h5UZSZdFxdROGGwBeGYNxowuVWFiH
X7Fa1J4QxTI7e2X3psDVz94bOnVTPRPAR2bNpX8K8Qs+Wn1FFO92LFU04EvJTCHe
JdN73VAS+0o0qPlPMewiuyfxaHexc8eJySMdOiysPnGRy+vagyyMPOV2Kg0DD6bp
caDxCXNjIxXlRExV6F75s8hnl42DwXzLSzY/G7L/HVJ5r3voqcREYtXHgfenl7Jg
8o6tEi+qFduPJ6SuRjfjPBDBF4wJvcjgmCwJaPJbMkrg8p5jH9Xg35egmEMo9cF8
JQa2RzWJTR9XUjuPAuHJZR6f9jnle01PCznmw7Mavoed82udW1Lo32+QnvWsx6Qq
4uuV38FqK3lsVCfFjyZir9OB9DGeuT/NETs3WJuGW5QUnC1mqfvIYipL3BkxNMKP
IBB7n5X2iCJ545JkydepXF2I+b/i8XhNcIwYMVoSbZzBKccwCZ7zxHFNj6YAWG+M
TN77x/+lw5zbnxhL3YzK+fgPNLio/By4Zcpmq6uppaf9Ip67SJGVq22Ef3S0w8vG
X1inh1zqLX9hsQ==
=DmSb
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A pile of x86 fixes:
- Prevent a memory leak in ioperm which was caused by the stupid
assumption that the exit cleanup is always called for current,
which is not the case when fork fails after taking a reference on
the ioperm bitmap.
- Fix an arithmething overflow in the DMA code on 32bit systems
- Fill gaps in the xstate copy with defaults instead of leaving them
uninitialized
- Revert: "Make __X32_SYSCALL_BIT be unsigned long" as it turned out
that existing user space fails to build"
* tag 'x86-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioperm: Prevent a memory leak when fork fails
x86/dma: Fix max PFN arithmetic overflow on 32 bit systems
copy_xstate_to_kernel(): don't leave parts of destination uninitialized
x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long"
now that can be done conveniently - all non-trivial cases have
_HAVE_ARCH_COPY_AND_CSUM_FROM_USER defined, so the fallback in
net/checksum.h is used only for dummy (copy_from_user, then
csum_partial) implementation. Allowing us to get rid of all
dummy instances, both of csum_and_copy_from_user() and
csum_partial_copy_from_user().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... rather than messing with the wrapper. As a side effect,
32bit variant gets access_ok() into it and can be switched to
user_access_begin()/user_access_end()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We already have stac/clac pair around the calls of csum_partial_copy_generic().
Stretch that area back, so that it covers the preceding loop (and convert
the loop body from __{get,put}_user() to unsafe_{get,put}_user()).
That brings the beginning of the areas to the earlier access_ok(),
which allows to convert them into user_access_{begin,end}() ones.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the copy_process() routine called by _do_fork(), failure to allocate
a PID (or further along in the function) will trigger an invocation to
exit_thread(). This is done to clean up from an earlier call to
copy_thread_tls(). Naturally, the child task is passed into exit_thread(),
however during the process, io_bitmap_exit() nullifies the parent's
io_bitmap rather than the child's.
As copy_thread_tls() has been called ahead of the failure, the reference
count on the calling thread's io_bitmap is incremented as we would expect.
However, io_bitmap_exit() doesn't accept any arguments, and thus assumes
it should trash the current thread's io_bitmap reference rather than the
child's. This is pretty sneaky in practice, because in all instances but
this one, exit_thread() is called with respect to the current task and
everything works out.
A determined attacker can issue an appropriate ioctl (i.e. KDENABIO) to
get a bitmap allocated, and force a clone3() syscall to fail by passing
in a zeroed clone_args structure. The kernel handles the erroneous struct
and the buggy code path is followed, and even though the parent's reference
to the io_bitmap is trashed, the child still holds a reference and thus
the structure will never be freed.
Fix this by tweaking io_bitmap_exit() and its subroutines to accept a
task_struct argument which to operate on.
Fixes: ea5f1cd7ab ("x86/ioperm: Remove bitmap if all permissions dropped")
Signed-off-by: Jay Lang <jaytlang@mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable#@vger.kernel.org
Link: https://lkml.kernel.org/r/20200524162742.253727-1-jaytlang@mit.edu
Even though the x86 ticket spinlock code has been removed with
cfd8983f03 ("x86, locking/spinlocks: Remove ticket (spin)lock implementation")
a while ago, there are still some ticket spinlock specific macros and
types left in the asm/spinlock_types.h header file that are no longer
used. Remove those as well to avoid confusion.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200526122014.25241-1-longman@redhat.com
Icelake microserver CPU supports split lock detection while it doesn't
have the split lock enumeration bit in IA32_CORE_CAPABILITIES. Tigerlake
CPUs do enumerate the MSR.
[ bp: Merge the two model-adding patches into one. ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1588290395-2677-1-git-send-email-fenghua.yu@intel.com
The intermediate result of the old term (4UL * 1024 * 1024 * 1024) is
4 294 967 296 or 0x100000000 which is no problem on 64 bit systems.
The patch does not change the later overall result of 0x100000 for
MAX_DMA32_PFN (after it has been shifted by PAGE_SHIFT). The new
calculation yields the same result, but does not require 64 bit
arithmetic.
On 32 bit systems the old calculation suffers from an arithmetic
overflow in that intermediate term in braces: 4UL aka unsigned long int
is 4 byte wide and an arithmetic overflow happens (the 0x100000000 does
not fit in 4 bytes), the in braces result is truncated to zero, the
following right shift does not alter that, so MAX_DMA32_PFN evaluates to
0 on 32 bit systems.
That wrong value is a problem in a comparision against MAX_DMA32_PFN in
the init code for swiotlb in pci_swiotlb_detect_4gb() to decide if
swiotlb should be active. That comparison yields the opposite result,
when compiling on 32 bit systems.
This was not possible before
1b7e03ef75 ("x86, NUMA: Enable emulation on 32bit too")
when that MAX_DMA32_PFN was first made visible to x86_32 (and which
landed in v3.0).
In practice this wasn't a problem, unless CONFIG_SWIOTLB is active on
x86-32.
However if one has set CONFIG_IOMMU_INTEL, since
c5a5dc4cbb ("iommu/vt-d: Don't switch off swiotlb if bounce page is used")
there's a dependency on CONFIG_SWIOTLB, which was not necessarily
active before. That landed in v5.4, where we noticed it in the fli4l
Linux distribution. We have CONFIG_IOMMU_INTEL active on both 32 and 64
bit kernel configs there (I could not find out why, so let's just say
historical reasons).
The effect is at boot time 64 MiB (default size) were allocated for
bounce buffers now, which is a noticeable amount of memory on small
systems like pcengines ALIX 2D3 with 256 MiB memory, which are still
frequently used as home routers.
We noticed this effect when migrating from kernel v4.19 (LTS) to v5.4
(LTS) in fli4l and got that kernel messages for example:
Linux version 5.4.22 (buildroot@buildroot) (gcc version 7.3.0 (Buildroot 2018.02.8)) #1 SMP Mon Nov 26 23:40:00 CET 2018
…
Memory: 183484K/261756K available (4594K kernel code, 393K rwdata, 1660K rodata, 536K init, 456K bss , 78272K reserved, 0K cma-reserved, 0K highmem)
…
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
software IO TLB: mapped [mem 0x0bb78000-0x0fb78000] (64MB)
The initial analysis and the suggested fix was done by user 'sourcejedi'
at stackoverflow and explicitly marked as GPLv2 for inclusion in the
Linux kernel:
https://unix.stackexchange.com/a/520525/50007
The new calculation, which does not suffer from that overflow, is the
same as for arch/mips now as suggested by Robin Murphy.
The fix was tested by fli4l users on round about two dozen different
systems, including both 32 and 64 bit archs, bare metal and virtualized
machines.
[ bp: Massage commit message. ]
Fixes: 1b7e03ef75 ("x86, NUMA: Enable emulation on 32bit too")
Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Alexander Dahl <post@lespocky.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://unix.stackexchange.com/q/520065/50007
Link: https://web.nettworks.org/bugs/browse/FFL-2560
Link: https://lkml.kernel.org/r/20200526175749.20742-1-post@lespocky.de
The DISCONTIGMEM support was marked as deprecated in v5.2 and since there
were no complaints about it for almost 5 releases it can be completely
removed.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20200223094322.15206-1-rppt@kernel.org
AMD's next generation of EPYC processors support the MPK (Memory
Protection Keys) feature. Update the dependency and documentation.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/159068199556.26992.17733929401377275140.stgit@naples-babu.amd.com
vmx_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as
an optimization, but this is only correct before the nested vmentry.
If userspace is modifying CR3 with KVM_SET_SREGS after the VM has
already been put in guest mode, the value of CR3 will not be updated.
Remove the optimization, which almost never triggers anyway.
Fixes: 04f11ef458 ("KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
svm_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as
an optimization, but this is only correct before the nested vmentry.
If userspace is modifying CR3 with KVM_SET_SREGS after the VM has
already been put in guest mode, the value of CR3 will not be updated.
Remove the optimization, which almost never triggers anyway.
This was was added in commit 689f3bf216 ("KVM: x86: unify callbacks
to load paging root", 2020-03-16) just to keep the two vendor-specific
modules closer, but we'll fix VMX too.
Fixes: 689f3bf216 ("KVM: x86: unify callbacks to load paging root")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The usual drill at this point, except there is no code to remove because this
case was not handled at all.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All events now inject vmexits before vmentry rather than after vmexit. Therefore,
exit_required is not set anymore and we can remove it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This allows exceptions injected by the emulator to be properly delivered
as vmexits. The code also becomes simpler, because we can just let all
L0-intercepted exceptions go through the usual path. In particular, our
emulation of the VMX #DB exit qualification is very much simplified,
because the vmexit injection path can use kvm_deliver_exception_payload
to update DR6.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In case an interrupt arrives after nested.check_events but before the
call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt
window even if the interrupt is actually going to be a vmexit. This is
useless rather than harmful, but it really complicates reasoning about
SVM's handling of the VINTR intercept. We'd like to never bother with
the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in
that case there is no interrupt window and we can just exit the nested
guest whenever we want.
This patch moves the opening of the interrupt window inside
inject_pending_event. This consolidates the check for pending
interrupt/NMI/SMI in one place, and makes KVM's usage of immediate
exits more consistent, extending it beyond just nested virtualization.
There are two functional changes here. They only affect corner cases,
but overall they simplify the inject_pending_event.
- re-injection of still-pending events will also use req_immediate_exit
instead of using interrupt-window intercepts. This should have no impact
on performance on Intel since it simply replaces an interrupt-window
or NMI-window exit for a preemption-timer exit. On AMD, which has no
equivalent of the preemption time, it may incur some overhead but an
actual effect on performance should only be visible in pathological cases.
- kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true
if an interrupt, NMI or SMI is blocked by nested_run_pending. This
makes sense because entering the VM will allow it to make progress
and deliver the event.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch enables AMD Fam17h RAPL support for the Package level metric.
The support is as per AMD Fam17h Model31h (Zen2) and model 00-ffh (Zen1) PPR.
The same output is available via the energy-pkg pseudo event:
$ perf stat -a -I 1000 --per-socket -e power/energy-pkg/
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200527224659.206129-6-eranian@google.com
This patch modifies perf_probe_msr() by allowing passing of
struct perf_msr array where some entries are not populated, i.e.,
they have either an msr address of 0 or no attribute_group pointer.
This helps with certain call paths, e.g., RAPL.
In case the grp is NULL, the default sysfs visibility rule
applies which is to make the group visible. Without the patch,
you would get a kernel crash with a NULL group.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200527224659.206129-5-eranian@google.com
This patch modifies the default visibility of the attribute_group
for each RAPL event. By default if the grp.is_visible field is NULL,
sysfs considers that it must display the attribute group.
If the field is not NULL (callback function), then the return value
of the callback determines the visibility (0 = not visible). The RAPL
attribute groups had the field set to NULL, meaning that unless they
failed the probing from perf_msr_probe(), they would be visible. We want
to avoid having to specify attribute groups that are not supported by the HW
in the rapl_msrs[] array, they don't have an MSR address to begin with.
Therefore, we intialize the visible field of all RAPL attribute groups
to a callback that returns 0. If the RAPL msr goes through probing
and succeeds the is_visible field will be set back to NULL (visible).
If the probing fails the field is set to a callback that return 0 (not visible).
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200527224659.206129-4-eranian@google.com
This patch modifies the rapl_model struct to include architecture specific
knowledge in this previously Intel specific structure, and in particular
it adds the MSR for POWER_UNIT and the rapl_msrs array.
No functional changes.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200527224659.206129-3-eranian@google.com
To prepare for support of both Intel and AMD RAPL.
As per the AMD PPR, Fam17h support Package RAPL counters to monitor power usage.
The RAPL counter operates as with Intel RAPL, and as such it is beneficial
to share the code.
No change in functionality.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200527224659.206129-2-eranian@google.com
All callers of xen_register_pirq() pass -1 (no override) for the
gsi_override parameter. Remove it and related code.
Link: https://lore.kernel.org/r/20200428153640.76476-1-wei.liu@kernel.org
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
copy the corresponding pieces of init_fpstate into the gaps instead.
Cc: stable@kernel.org
Tested-by: Alexander Potapenko <glider@google.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of calling kvm_event_needs_reinjection, track its
future return value in a variable. This will be useful in
the next patch.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
L2 guest hang is observed after 'exit_required' was dropped and nSVM
switched to check_nested_events() completely. The hang is a busy loop when
e.g. KVM is emulating an instruction (e.g. L2 is accessing MMIO space and
we drop to userspace). After nested_svm_vmexit() and when L1 is doing VMRUN
nested guest's RIP is not advanced so KVM goes into emulating the same
instruction which caused nested_svm_vmexit() and the loop continues.
nested_svm_vmexit() is not new, however, with check_nested_events() we're
now calling it later than before. In case by that time KVM has modified
register state we may pick stale values from VMCB when trying to save
nested guest state to nested VMCB.
nVMX code handles this case correctly: sync_vmcs02_to_vmcs12() called from
nested_vmx_vmexit() does e.g 'vmcs12->guest_rip = kvm_rip_read(vcpu)' and
this ensures KVM-made modifications are preserved. Do the same for nSVM.
Generally, nested_vmx_vmexit()/nested_svm_vmexit() need to pick up all
nested guest state modifications done by KVM after vmexit. It would be
great to find a way to express this in a way which would not require to
manually track these changes, e.g. nested_{vmcb,vmcs}_get_field().
Co-debugged-with: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200527090102.220647-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize vcpu->arch.tdp_level during vCPU creation to avoid consuming
garbage if userspace calls KVM_RUN without first calling KVM_SET_CPUID.
Fixes: e93fd3b3e8 ("KVM: x86/mmu: Capture TDP level when updating CPUID")
Reported-by: syzbot+904752567107eefb728c@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200527085400.23759-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Restoring the ASID from the hsave area on VMEXIT is wrong, because its
value depends on the handling of TLB flushes. Just skipping the field in
copy_vmcb_control_area will do.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Async page faults have to be trapped in the host (L1 in this case),
since the APF reason was passed from L0 to L1 and stored in the L1 APF
data page. This was completely reversed: the page faults were passed
to the guest, a L2 hypervisor.
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
pic_in_kernel(), ioapic_in_kernel() and irqchip_kernel() have the
same implementation.
Signed-off-by: Peng Hao <richard.peng@oppo.com>
Message-Id: <HKAPR02MB4291D5926EA10B8BFE9EA0D3E0B70@HKAPR02MB4291.apcprd02.prod.outlook.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is a bad indentation in next&queue branch. The patch looks like
fixes nothing though it fixes the indentation.
Before fixing:
if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) {
kvm_skip_emulated_instruction(vcpu);
ret = EXIT_FASTPATH_EXIT_HANDLED;
}
break;
case MSR_IA32_TSCDEADLINE:
After fixing:
if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) {
kvm_skip_emulated_instruction(vcpu);
ret = EXIT_FASTPATH_EXIT_HANDLED;
}
break;
case MSR_IA32_TSCDEADLINE:
Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <2f78457e-f3a7-3bc9-e237-3132ee87f71e@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The second "/* fall through */" in rmode_exception() makes code harder to
read. Replace it with "return" to indicate they are different cases, only
the #DB and #BP check vcpu->guest_debug, while others don't care. And this
also improves the readability.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Message-Id: <1582080348-20827-1-git-send-email-linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Take a u32 for the index in has_emulated_msr() to match hardware, which
treats MSR indices as unsigned 32-bit values. Functionally, taking a
signed int doesn't cause problems with the current code base, but could
theoretically cause problems with 32-bit KVM, e.g. if the index were
checked via a less-than statement, which would evaluate incorrectly for
MSR indices with bit 31 set.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200218234012.7110-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove unnecessary brackets from a case statement that unintentionally
encapsulates unrelated case statements in the same switch statement.
While technically legal and functionally correct syntax, the brackets
are visually confusing and potentially dangerous, e.g. the last of the
encapsulated case statements has an undocumented fall-through that isn't
flagged by compilers due the encapsulation.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200218234012.7110-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The migration functionality was left incomplete in commit 5ef8acbdd6
("KVM: nVMX: Emulate MTF when performing instruction emulation", 2020-02-23),
fix it.
Fixes: 5ef8acbdd6 ("KVM: nVMX: Emulate MTF when performing instruction emulation")
Cc: stable@vger.kernel.org
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We can simply look at bits 52-53 to identify MMIO entries in KVM's page
tables. Therefore, there is no need to pass a mask to kvm_mmu_set_mmio_spte_mask.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This msr is only available when the host supports WAITPKG feature.
This breaks a nested guest, if the L1 hypervisor is set to ignore
unknown msrs, because the only other safety check that the
kernel does is that it attempts to read the msr and
rejects it if it gets an exception.
Cc: stable@vger.kernel.org
Fixes: 6e3ba4abce ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200523161455.3940-3-mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Even though we might not allow the guest to use WAITPKG's new
instructions, we should tell KVM that the feature is supported by the
host CPU.
Note that vmx_waitpkg_supported checks that WAITPKG _can_ be set in
secondary execution controls as specified by VMX capability MSR, rather
that we actually enable it for a guest.
Cc: stable@vger.kernel.org
Fixes: e69e72faa3 ("KVM: x86: Add support for user wait instructions")
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200523161455.3940-2-mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set the mmio_value to '0' instead of simply clearing the present bit to
squash a benign warning in kvm_mmu_set_mmio_spte_mask() that complains
about the mmio_value overlapping the lower GFN mask on systems with 52
bits of PA space.
Opportunistically clean up the code and comments.
Cc: stable@vger.kernel.org
Fixes: d43e2675e9 ("KVM: x86: only do L1TF workaround on affected processors")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200527084909.23492-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is a generic, kernel wide configuration symbol for enabling the
IOMMU specific bits: CONFIG_IOMMU_API. Implementations (including
INTEL_IOMMU and AMD_IOMMU driver) select it so use it here as well.
This makes the conditional archdata.iommu field consistent with other
platforms and also fixes any compile test builds of other IOMMU drivers,
when INTEL_IOMMU or AMD_IOMMU are not selected).
For the case when INTEL_IOMMU/AMD_IOMMU and COMPILE_TEST are not
selected, this should create functionally equivalent code/choice. With
COMPILE_TEST this field could appear if other IOMMU drivers are chosen
but neither INTEL_IOMMU nor AMD_IOMMU are not.
Reported-by: kbuild test robot <lkp@intel.com>
Fixes: e93a1695d7 ("iommu: Enable compile testing for some of drivers")
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200518120855.27822-2-krzk@kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Drop an extern declaration that has never been used and a no longer
needed macro.
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200513100944.9171-2-johan@kernel.org
Drop the APB-timer TSC calibration, which hasn't been used since the
removal of Moorestown support by commit
1a8359e411 ("x86/mid: Remove Intel Moorestown").
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200513100944.9171-1-johan@kernel.org
The commit
c84cb3735f ("x86/apic: Move TSC deadline timer debug printk")
removed the message which said that the deadline timer was enabled.
It added a pr_debug() message which is issued when deadline timer
validation succeeds.
Well, issued only when CONFIG_DYNAMIC_DEBUG is enabled - otherwise
pr_debug() calls get optimized away if DEBUG is not defined in the
compilation unit.
Therefore, make the above message pr_info() so that it is visible in
dmesg.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200525104218.27018-1-bp@alien8.de
- Rename pr_efi/pr_efi_err to efi_info/efi_err, and use them consistently
- Simplify and unify initrd loading
- Parse the builtin command line on x86 (if provided)
- Implement printk() support, including support for wide character strings
- Some fixes for issues introduced by the first batch of v5.8 changes
- Fix a missing prototypes warning
- Simplify GDT handling in early mixed mode thunking code
- Some other minor fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEnNKg2mrY9zMBdeK7wjcgfpV0+n0FAl7Lb8UACgkQwjcgfpV0
+n3/aAgAkEqqR/BoyzFiyYHujq6bXjESKYr8LrIjNWfnofB6nZqp1yXwFdL0qbj/
PTZ1qIQAnOMmj11lvy1X894h2ZLqE6XEkqv7Xd2oxkh3fF6amlQUWfMpXUuGLo1k
C4QGSfA0OOiM0OOi0Aqk1fL7sTmH23/j63dTR+fH8JMuYgjdls/yWNs0miqf8W2H
ftj8fAKgHIJzFvdTC0vn1DZ6dEKczGLPEcVZ2ns2IJOJ69DsStKPLcD0mlW+EgV2
EyfRSCQv55RYZRhdUOb+yVLRfU0M0IMDrrCDErHxZHXnQy00tmKXiEL20yuegv3u
MUtRRw8ocn2/RskjgZkxtMjAAlty9A==
=AwCh
-----END PGP SIGNATURE-----
Merge tag 'efi-changes-for-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core
More EFI changes for v5.8:
- Rename pr_efi/pr_efi_err to efi_info/efi_err, and use them consistently
- Simplify and unify initrd loading
- Parse the builtin command line on x86 (if provided)
- Implement printk() support, including support for wide character strings
- Some fixes for issues introduced by the first batch of v5.8 changes
- Fix a missing prototypes warning
- Simplify GDT handling in early mixed mode thunking code
- Some other minor fixes and cleanups
Conflicts:
drivers/firmware/efi/libstub/efistub.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
- Don't return a garbage screen info when EFI framebuffer is not available
- Make the early EFI console work proper with wider fonts instead of drawing
garbage
- Prevent a memory buffer leak in allocate_e820()
- Print the firmware error record proper so it can be decoded by users
- Fix a symbol clash in the host tool build which only happens with newer
compilers.
- Add a missing check for the event log version of TPM which caused boot
fails on several Dell systems due to an attempt to decode SHA-1 format
with the crypto agile algorithm
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7KiA4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoS+WD/93Pd1AyO2wX8EBp7hKMFIof2fUlFGd
yErHZibCZzTbvxsN+g1WNJBI7/9OjCcU6rG383ky4hZzZZ/pjtLhOSO08Q9mGsIg
8lTQWozBWTjz73uHi4XItF9UkXy3QG7CwWFa3wgpD/VxTEgcEblczDJ+YU9TNX/L
cA2TnSmjey+Kh6BbBMp1mKIpFN+hryAv70d2qaJJJkEolNDAkDpRfefaVdlbm/2E
8gqNXaETn7nhaJpyHEqjipOAKYsTHbASfB3NLXjh+R88em5v0LGg/EZ9UTgFAq2x
kQZr/O3wgOZ5ahzhtCcTx8VyTVE7AFxJqi8MkaTEYzQeFwK5xBJgf/tpJ/2CH7aj
S0/dDZ4U3hyG29DI6BKDAQOxX5H315Q/7FBLLVHXGXO4VEz18qowZN4WkemPB27m
5jt07TCHp+tf3TAaunjmISrUrh8ZD4SB6gnMHKXM2x7t80hxoqm7gI6Cf4tdGB3S
DK6uYMydG5ecdmtrgZzWWDVX42D6vfKkVdYuAZlrBaZ6gFGs4WM2vmDDozmx5MRk
znFr5hjjVoXyBwTs0UavBeSCOlB0/ifXICzg0Ba5/wG1Li9DUX3KwG7mlWVJnyfo
r/CryLmeIEZ7JPl60+gXT3Nnd6dTgiA4EcR53HhPEbSoJ+58ITcuxPm4lCRdesJK
QLlF4Yye/nn14Q==
=BiIm
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Thomas Gleixner:
"A set of EFI fixes:
- Don't return a garbage screen info when EFI framebuffer is not
available
- Make the early EFI console work properly with wider fonts instead
of drawing garbage
- Prevent a memory buffer leak in allocate_e820()
- Print the firmware error record properly so it can be decoded by
users
- Fix a symbol clash in the host tool build which only happens with
newer compilers.
- Add a missing check for the event log version of TPM which caused
boot failures on several Dell systems due to an attempt to decode
SHA-1 format with the crypto agile algorithm"
* tag 'efi-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tpm: check event log version before reading final events
efi: Pull up arch-specific prototype efi_systab_show_arch()
x86/boot: Mark global variables as static
efi: cper: Add support for printing Firmware Error Record Reference
efi/libstub/x86: Avoid EFI map buffer alloc in allocate_e820()
efi/earlycon: Fix early printk for wider fonts
efi/libstub: Avoid returning uninitialized data from setup_graphics()
- Unbreak stack dumps for inactive tasks by interpreting the special
first frame left by __switch_to_asm() correctly. The recent change not
to skip the first frame so ORC and frame unwinder behave in the same
way caused all entries to be unreliable, i.e. prepended with '?'.
- Use cpumask_available() instead of an implicit NULL check of a
cpumask_var_t in mmio trace to prevent a Clang build warning
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7KjZITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofGED/9q61QWzA7WULpps2UA1JDa8JGvxwIl
Z/juNskVAXWZRlBOACPD7mZNz3tkfHnl62igHIwUNlEddSANQ22c/7Yt74w+lCcD
KtPVVx/zdnO2nt5HVekRT6D9pKfD8cfSF4X2k2/HF6u8hQGoqWGv2BVBuarNurWE
3CIFtLbNvBhjI4WdzK7Y0IfcINSkcyABQn1+9Id8mwH8XOStl1aaIMY7hxlpj9e4
mXoQtkbRXnTbv6Asw6Obb1F/7AtCdaDrNqfBCA0Juv4fJzPQOMgZSWEX6OtZny6E
8vsDsSCYOY4wcGYH5CBJd3n48UOsrWbT+7yNLeAnE7ZaZzc0pdi0g2NVWUGuhYa3
EbPzvj+kPgcVsfpfasts9KRAR57GNysKD8MLZGqaST9MAB7EKLbPWEDnhNnsAnU5
3KNFEbfB16CyJmztlE2YCT6nNJ3rzaOtcDiRmJduf0Ib9PEEkPaaX85DfO0Yabnn
QilGsYbkdux+UTQUtZg6+HPsikcKiN46hOLrSXXu1O+iMDxhL/mq/79hNrO9hffI
idV+js2nxv9tC30MMczMdPuUX4nOHs26IMZObdV88gDMV9n9TGkW+XinoJBi/+er
3xuDQw6aRqpolMmUVFhBLV0gYTB2+J0zc3eawa5c6U6B9avc4j4KxkNVIfrLiRkK
3brABHq+di44MA==
=COcb
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two fixes for x86:
- Unbreak stack dumps for inactive tasks by interpreting the special
first frame left by __switch_to_asm() correctly.
The recent change not to skip the first frame so ORC and frame
unwinder behave in the same way caused all entries to be
unreliable, i.e. prepended with '?'.
- Use cpumask_available() instead of an implicit NULL check of a
cpumask_var_t in mmio trace to prevent a Clang build warning"
* tag 'x86-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasks
x86/mmiotrace: Use cpumask_available() for cpumask_var_t variables
Instead of using efi_gdt64 to switch back to 64-bit mode and then
switching to the real boot-time GDT, just switch to the boot-time GDT
directly. The two GDT's are identical other than efi_gdt64 not including
the 32-bit code segment.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20200523221513.1642948-1-nivedita@alum.mit.edu
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
This is easily reproducible via CC=clang + CONFIG_STAGING=y +
CONFIG_VT6656=m.
It turns out that if your config tickles __builtin_constant_p via
differences in choices to inline or not, these statements produce
invalid assembly:
$ cat foo.c
long a(long b, long c) {
asm("orb %1, %0" : "+q"(c): "r"(b));
return c;
}
$ gcc foo.c
foo.c: Assembler messages:
foo.c:2: Error: `%rax' not allowed with `orb'
Use the `%b` "x86 Operand Modifier" to instead force register allocation
to select a lower-8-bit GPR operand.
The "q" constraint only has meaning on -m32 otherwise is treated as
"r". Not all GPRs have low-8-bit aliases for -m32.
Fixes: 1651e70066 ("x86: Fix bitops.h warning with a moved cast")
Reported-by: kernelci.org bot <bot@kernelci.org>
Suggested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Suggested-by: Brian Gerst <brgerst@gmail.com>
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Suggested-by: Ilie Halip <ilie.halip@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com> [build, clang-11]
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-By: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Marco Elver <elver@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Daniel Axtens <dja@axtens.net>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20200508183230.229464-1-ndesaulniers@google.com
Link: https://github.com/ClangBuiltLinux/linux/issues/961
Link: https://lore.kernel.org/lkml/20200504193524.GA221287@google.com/
Link: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#x86Operandmodifiers
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Distributed GRU mode appeared in only one generation of UV hardware,
and no version of the BIOS has shipped with this feature enabled, and
we have no plans to ever change that. The gru.s3.mode check has
always been and will continue to be false. So remove this dead code.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dimitri Sivanich <sivanich@hpe.com>
Link: https://lkml.kernel.org/r/20200513221123.GJ3240@raspberrypi
- fix EFI framebuffer earlycon for wide fonts
- avoid filling screen_info with garbage if the EFI framebuffer is not
available
- fix a potential host tool build error due to a symbol clash on x86
- work around a EFI firmware bug regarding the binary format of the TPM
final events table
- fix a missing memory free by reworking the E820 table sizing routine to
not do the allocation in the first place
- add CPER parsing for firmware errors
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEnNKg2mrY9zMBdeK7wjcgfpV0+n0FAl7H3HIACgkQwjcgfpV0
+n1pEAgAjJfwDJmBcYhJzjX8WLnXPJiUmUH9d9tF1t3TlhF6c1G8auXU+Fyia4uI
ejRNw/N4+SXzM9yL+Z19PKBpQsPzQXgm2r9WTPVN5jTelUUI+jFZCH+pKC+TKRp1
/Tx/XIMifCw18gNXsjj6WJEeAyLoh4tb+6bwn7DlPO5cPrxX49LvPuQNMXybk2yi
KimdNKUry1wYpo/WpHqEdFq5//CLAWNkrL9UXlkANvQ6BJNIMI0kRIUC0MVsTMnE
BoCkBO93PdvqxOcnV3WTRvSFetb7qA59Jay62jLc26Myqc4t4pgVWojVm6RHLfZg
17btYACxICgF2mNTZYlKemEEqKPpzQ==
=mY5f
-----END PGP SIGNATURE-----
Merge tag 'efi-fixes-for-v5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/urgent
Pull EFI fixes from Ard Biesheuvel:
"- fix EFI framebuffer earlycon for wide fonts
- avoid filling screen_info with garbage if the EFI framebuffer is not
available
- fix a potential host tool build error due to a symbol clash on x86
- work around a EFI firmware bug regarding the binary format of the TPM
final events table
- fix a missing memory free by reworking the E820 table sizing routine to
not do the allocation in the first place
- add CPER parsing for firmware errors"
Normally, show_trace_log_lvl() scans the stack, looking for text
addresses to print. In parallel, it unwinds the stack with
unwind_next_frame(). If the stack address matches the pointer returned
by unwind_get_return_address_ptr() for the current frame, the text
address is printed normally without a question mark. Otherwise it's
considered a breadcrumb (potentially from a previous call path) and it's
printed with a question mark to indicate that the address is unreliable
and typically can be ignored.
Since the following commit:
f1d9a2abff ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
... for inactive tasks, show_trace_log_lvl() prints *only* unreliable
addresses (prepended with '?').
That happens because, for the first frame of an inactive task,
unwind_get_return_address_ptr() returns the wrong return address
pointer: one word *below* the task stack pointer. show_trace_log_lvl()
starts scanning at the stack pointer itself, so it never finds the first
'reliable' address, causing only guesses to being printed.
The first frame of an inactive task isn't a normal stack frame. It's
actually just an instance of 'struct inactive_task_frame' which is left
behind by __switch_to_asm(). Now that this inactive frame is actually
exposed to callers, fix unwind_get_return_address_ptr() to interpret it
properly.
Fixes: f1d9a2abff ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200522135435.vbxs7umku5pyrdbk@treble
Add PCI IDs for AMD Renoir (4000-series Ryzen CPUs). This is necessary
to enable support for temperature sensors via the k10temp module.
Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Yazen Ghannam <yazen.ghannam@amd.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lkml.kernel.org/r/20200510204842.2603-2-amonakov@ispras.ru
With commit
ce5e3f909f ("efi/printf: Add 64-bit and 8-bit integer support")
arch/x86/boot/compressed/vmlinux may have an undesired .discard.unreachable
section coming from drivers/firmware/efi/libstub/vsprintf.stub.o. That section
gets generated from unreachable() annotations when CONFIG_STACK_VALIDATION is
enabled.
.discard.unreachable contains an R_X86_64_PC32 relocation which will be
warned about by LLD: a non-SHF_ALLOC section (.discard.unreachable) is
not part of the memory image, thus conceptually the distance between a
non-SHF_ALLOC and a SHF_ALLOC is not a constant which can be resolved at
link time:
% ld.lld -m elf_x86_64 -T arch/x86/boot/compressed/vmlinux.lds ... -o arch/x86/boot/compressed/vmlinux
ld.lld: warning: vsprintf.c:(.discard.unreachable+0x0): has non-ABS relocation R_X86_64_PC32 against symbol ''
Reuse the DISCARDS macro which includes .discard.* to drop
.discard.unreachable.
[ bp: Massage and complete the commit message. ]
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Arvind Sankar <nivedita@alum.mit.edu>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Link: https://lkml.kernel.org/r/20200520182010.242489-1-maskray@google.com
Changing base clock frequency directly impacts TSC Hz but not CPUID.16h
value. An overclocked CPU supporting CPUID.16h and with partial CPUID.15h
support will set TSC KHZ according to "best guess" given by CPUID.16h
relying on tsc_refine_calibration_work to give better numbers later.
tsc_refine_calibration_work will refuse to do its work when the outcome is
off the early TSC KHZ value by more than 1% which is certain to happen on
an overclocked system.
Fix this by adding a tsc_early_khz command line parameter that makes the
kernel skip early TSC calibration and use the given value instead.
This allows the user to provide the expected TSC frequency that is closer
to reality than the one reported by the hardware, enabling
tsc_refine_calibration_work to do meaningful error checking.
[ tglx: Made the variable __initdata as it's only used on init and
removed the error checking in the argument parser because
kstrto*() only stores to the variable if the string is valid ]
Signed-off-by: Krzysztof Piecuch <piecuch@protonmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/O2CpIOrqLZHgNRkfjRpz_LGqnc1ix_seNIiOCvHY4RHoulOVRo6kMXKuLOfBVTi0SMMevg6Go1uZ_cL9fLYtYdTRNH78ChaFaZyG3VAyYz8=@protonmail.com
Add the required typedefs etc for using con_in's simple text input
protocol, and for using the boottime event services.
Also add the prototype for the "stall" boot service.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20200518190716.751506-19-nivedita@alum.mit.edu
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
In preparation for adding ARM64 support, split hyperv-tlfs.h into
architecture dependent and architecture independent files, similar
to what has been done with mshyperv.h. Move architecture independent
definitions into include/asm-generic/hyperv-tlfs.h. The split will
avoid duplicating significant lines of code in the ARM64 version of
hyperv-tlfs.h. The split has no functional impact.
Some of the common definitions have "X64" in the symbol name. Change
these to remove the "X64" in the architecture independent version of
hyperv-tlfs.h, but add aliases with the "X64" in the x86 version so
that x86 code will continue to compile. A later patch set will
change all the references and allow removal of the aliases.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20200422195737.10223-4-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The HV_PROCESSOR_POWER_STATE_C<n> #defines date back to year 2010,
but they are not in the TLFS v6.0 document and are not used anywhere
in Linux. Remove them.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20200422195737.10223-3-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The Hyper-V Reference TSC Page structure is defined twice. struct
ms_hyperv_tsc_page has padding out to a full 4 Kbyte page size. But
the padding is not needed because the declaration includes a union
with HV_HYP_PAGE_SIZE. KVM uses the second definition, which is
struct _HV_REFERENCE_TSC_PAGE, because it does not have the padding.
Fix the duplication by removing the padding from ms_hyperv_tsc_page.
Fix up the KVM code to use it. Remove the no longer used struct
_HV_REFERENCE_TSC_PAGE.
There is no functional change.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20200422195737.10223-2-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200511200911.GA13149@embeddedor
The mask in the extra_regs for Intel Tremont need to be extended to
allow more defined bits.
"Outstanding Requests" (bit 63) is only available on MSR_OFFCORE_RSP0;
Fixes: 6daeb8737f ("perf/x86/intel: Add Tremont core PMU support")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200501125442.7030-1-kan.liang@linux.intel.com
Enable RAPL support for Intel Ice Lake X and Ice Lake D.
For RAPL support, it is identical to Sky Lake X.
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1588857258-38213-1-git-send-email-kan.liang@linux.intel.com
When building with Clang + -Wtautological-compare and
CONFIG_CPUMASK_OFFSTACK unset:
arch/x86/mm/mmio-mod.c:375:6: warning: comparison of array 'downed_cpus'
equal to a null pointer is always false [-Wtautological-pointer-compare]
if (downed_cpus == NULL &&
^~~~~~~~~~~ ~~~~
arch/x86/mm/mmio-mod.c:405:6: warning: comparison of array 'downed_cpus'
equal to a null pointer is always false [-Wtautological-pointer-compare]
if (downed_cpus == NULL || cpumask_weight(downed_cpus) == 0)
^~~~~~~~~~~ ~~~~
2 warnings generated.
Commit
f7e30f01a9 ("cpumask: Add helper cpumask_available()")
added cpumask_available() to fix warnings of this nature. Use that here
so that clang does not warn regardless of CONFIG_CPUMASK_OFFSTACK's
value.
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/982
Link: https://lkml.kernel.org/r/20200408205323.44490-1-natechancellor@gmail.com
The async page fault injection into kernel space creates more problems than
it solves. The host has absolutely no knowledge about the state of the
guest if the fault happens in CPL0. The only restriction for the host is
interrupt disabled state. If interrupts are enabled in the guest then the
exception can hit arbitrary code. The HALT based wait in non-preemotible
code is a hacky replacement for a proper hypercall.
For the ongoing work to restrict instrumentation and make the RCU idle
interaction well defined the required extra work for supporting async
pagefault in CPL0 is just not justified and creates complexity for a
dubious benefit.
The CPL3 injection is well defined and does not cause any issues as it is
more or less the same as a regular page fault from CPL3.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.369802541@linutronix.de
While working on the entry consolidation I stumbled over the KVM async page
fault handler and kvm_async_pf_task_wait() in particular. It took me a
while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
invocations are just cargo cult programming. Several patches "fixed" RCU
splats by curing the symptoms without noticing that the code is flawed
from a design perspective.
The main problem is that this async injection is not based on a proper
handshake mechanism and only respects the minimal requirement, i.e. the
guest is not in a state where it has interrupts disabled.
Aside of that the actual code is a convoluted one fits it all swiss army
knife. It is invoked from different places with different RCU constraints:
1) Host side:
vcpu_enter_guest()
kvm_x86_ops->handle_exit()
kvm_handle_page_fault()
kvm_async_pf_task_wait()
The invocation happens from fully preemptible context.
2) Guest side:
The async page fault interrupted:
a) user space
b) preemptible kernel code which is not in a RCU read side
critical section
c) non-preemtible kernel code or a RCU read side critical section
or kernel code with CONFIG_PREEMPTION=n which allows not to
differentiate between #2b and #2c.
RCU is watching for:
#1 The vCPU exited and current is definitely not the idle task
#2a The #PF entry code on the guest went through enter_from_user_mode()
which reactivates RCU
#2b There is no preemptible, interrupts enabled code in the kernel
which can run with RCU looking away. (The idle task is always
non preemptible).
I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
voodoo at all.
In #2c RCU is eventually not watching, but as that state cannot schedule
anyway there is no point to worry about it so it has to invoke
rcu_irq_enter() before running that code. This can be optimized, but this
will be done as an extra step in course of the entry code consolidation
work.
So the proper solution for this is to:
- Split kvm_async_pf_task_wait() into schedule and halt based waiting
interfaces which share the enqueueing code.
- Add comments (condensed form of this changelog) to spare others the
time waste and pain of reverse engineering all of this with the help of
uncomprehensible changelogs and code history.
- Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
user mode and schedulable kernel side async page faults (#1, #2a, #2b)
- Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
case (#2c).
For this case also remove the rcu_irq_exit()/enter() pair around the
halt as it is just a pointless exercise:
- vCPUs can VMEXIT at any random point and can be scheduled out for
an arbitrary amount of time by the host and this is not any
different except that it voluntary triggers the exit via halt.
- The interrupted context could have RCU watching already. So the
rcu_irq_exit() before the halt is not gaining anything aside of
confusing the reader. Claiming that this might prevent RCU stalls
is just an illusion.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.262701431@linutronix.de
KVM overloads #PF to indicate two types of not-actually-page-fault
events. Right now, the KVM guest code intercepts them by modifying
the IDT and hooking the #PF vector. This makes the already fragile
fault code even harder to understand, and it also pollutes call
traces with async_page_fault and do_async_page_fault for normal page
faults.
Clean it up by moving the logic into do_page_fault() using a static
branch. This gets rid of the platform trap_init override mechanism
completely.
[ tglx: Fixed up 32bit, removed error code from the async functions and
massaged coding style ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.169270470@linutronix.de
A few exceptions (like #DB and #BP) can happen at any location in the code,
this then means that tracers should treat events from these exceptions as
NMI-like. The interrupted context could be holding locks with interrupts
disabled for instance.
Similarly, #MC is an actual NMI-like exception.
All of them use ist_enter() which only concerns itself with RCU, but does
not do any of the other setup that NMIs need. This means things like:
printk()
raw_spin_lock_irq(&logbuf_lock);
<#DB/#BP/#MC>
printk()
raw_spin_lock_irq(&logbuf_lock);
are entirely possible (well, not really since printk tries hard to
play nice, but the concept stands).
So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter()
caller must be both notrace and NOKPROBE, or in the noinstr text section.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.525508608@linutronix.de
Convert #MC over to using task_work_add(); it will run the same code
slightly later, on the return to user path of the same exception.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.957390899@linutronix.de
This is completely overengineered and definitely not an interface which
should be made available to anything else than this particular MCE case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.462640294@linutronix.de
For the 32-bit kernel, as described in
6d92bc9d48 ("x86/build: Build compressed x86 kernels as PIE"),
pre-2.26 binutils generates R_386_32 relocations in PIE mode. Since the
startup code does not perform relocation, any reloc entry with R_386_32
will remain as 0 in the executing code.
Commit
974f221c84 ("x86/boot: Move compressed kernel to the end of the
decompression buffer")
added a new symbol _end but did not mark it hidden, which doesn't give
the correct offset on older linkers. This causes the compressed kernel
to be copied beyond the end of the decompression buffer, rather than
flush against it. This region of memory may be reserved or already
allocated for other purposes by the bootloader.
Mark _end as hidden to fix. This changes the relocation from R_386_32 to
R_386_RELATIVE even on the pre-2.26 binutils.
For 64-bit, this is not strictly necessary, as the 64-bit kernel is only
built as PIE if the linker supports -z noreloc-overflow, which implies
binutils-2.27+, but for consistency, mark _end as hidden here too.
The below illustrates the before/after impact of the patch using
binutils-2.25 and gcc-4.6.4 (locally compiled from source) and QEMU.
Disassembly before patch:
48: 8b 86 60 02 00 00 mov 0x260(%esi),%eax
4e: 2d 00 00 00 00 sub $0x0,%eax
4f: R_386_32 _end
Disassembly after patch:
48: 8b 86 60 02 00 00 mov 0x260(%esi),%eax
4e: 2d 00 f0 76 00 sub $0x76f000,%eax
4f: R_386_RELATIVE *ABS*
Dump from extract_kernel before patch:
early console in extract_kernel
input_data: 0x0207c098 <--- this is at output + init_size
input_len: 0x0074fef1
output: 0x01000000
output_len: 0x00fa63d0
kernel_total_size: 0x0107c000
needed_size: 0x0107c000
Dump from extract_kernel after patch:
early console in extract_kernel
input_data: 0x0190d098 <--- this is at output + init_size - _end
input_len: 0x0074fef1
output: 0x01000000
output_len: 0x00fa63d0
kernel_total_size: 0x0107c000
needed_size: 0x0107c000
Fixes: 974f221c84 ("x86/boot: Move compressed kernel to the end of the decompression buffer")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200207214926.3564079-1-nivedita@alum.mit.edu
KVM stores the gfn in MMIO SPTEs as a caching optimization. These are split
in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
in an L1TF attack. This works as long as there are 5 free bits between
MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
access triggers a reserved-bit-set page fault.
The bit positions however were computed wrongly for AMD processors that have
encryption support. In this case, x86_phys_bits is reduced (for example
from 48 to 43, to account for the C bit at position 47 and four bits used
internally to store the SEV ASID and other stuff) while x86_cache_bits in
would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
and bit 51 are set. Then low_phys_bits would also cover some of the
bits that are set in the shadow_mmio_value, terribly confusing the gfn
caching mechanism.
To fix this, avoid splitting gfns as long as the processor does not have
the L1TF bug (which includes all AMD processors). When there is no
splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
the overlap. This fixes "npt=0" operation on EPYC processors.
Thanks to Maxim Levitsky for bisecting this bug.
Cc: stable@vger.kernel.org
Fixes: 52918ed5fc ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Current minimum required version of binutils is 2.23,
which supports RDRAND and RDSEED instruction mnemonics.
Replace the byte-wise specification of RDRAND and
RDSEED with these proper mnemonics.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200508105817.207887-1-ubizjak@gmail.com
Using kgdb requires at least some level of architecture-level
initialization. If nothing else, it relies on the architecture to
pass breakpoints / crashes onto kgdb.
On some architectures this all works super early, specifically it
starts working at some point in time before Linux parses
early_params's. On other architectures it doesn't. A survey of a few
platforms:
a) x86: Presumably it all works early since "ekgdboc" is documented to
work here.
b) arm64: Catching crashes works; with a simple patch breakpoints can
also be made to work.
c) arm: Nothing in kgdb works until
paging_init() -> devicemaps_init() -> early_trap_init()
Let's be conservative and, by default, process "kgdbwait" (which tells
the kernel to drop into the debugger ASAP at boot) a bit later at
dbg_late_init() time. If an architecture has tested it and wants to
re-enable super early debugging, they can select the
ARCH_HAS_EARLY_DEBUG KConfig option. We'll do this for x86 to start.
It should be noted that dbg_late_init() is still called quite early in
the system.
Note that this patch doesn't affect when kgdb runs its init. If kgdb
is set to initialize early it will still initialize when parsing
early_param's. This patch _only_ inhibits the initial breakpoint from
"kgdbwait". This means:
* Without any extra patches arm64 platforms will at least catch
crashes after kgdb inits.
* arm platforms will catch crashes (and could handle a hardcoded
kgdb_breakpoint()) any time after early_trap_init() runs, even
before dbg_late_init().
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20200507130644.v4.4.I3113aea1b08d8ce36dc3720209392ae8b815201b@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl7BzV8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg8EH/A2pXMTxtc96RI4S
sttEsUQqbakFS0Z/2tQPpMGr/qW2e5eHgsTX/a3SiUeZiIXk6f4lMFkMuctzBf7p
X77cNEDwGOEdbtCXTsMcmKSde7sP2zCXsPB8xTWLyE6rnaFRgikwwkeqgkIKhp1h
bvOQV0t9HNGvxGAM0iZeOvQAvFl4vd7nS123/MYbir9cugfQUSJRueQ4BiCiJqVE
6cNA7/vFzDJuFGszzIrJ7HXn/IdQMMWHkvTDjgBw0GZw1mDbGFbfbZwOeTz1ojCt
smUQ4tIFxBa/VA5zx7dOy2P2keHbSVf4VLkZRPcceT7OqVS65ETmFDp+qt5NdWM5
vZ8+7/0=
=CyYH
-----END PGP SIGNATURE-----
Merge tag 'v5.7-rc6' into objtool/core, to pick up fixes and resolve semantic conflict
Resolve structural conflict between:
59566b0b62: ("x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up")
which introduced a new reference to 'ftrace_epilogue', and:
0298739b79: ("x86,ftrace: Fix ftrace_regs_caller() unwind")
Which renamed it to 'ftrace_caller_end'. Rename the new usage site in the merge commit.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
tells the unwinding code whether a stack trace can be trusted or not is
always set correctly. This was messed up by a couple of changes in the
recent past.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7BC+gTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWFBEACR8MiO0VM2XXNsejd7rttgs/eoC4/M
IKM5K1hq4eRCTodwVnkWwLk6p0asAMKhzpWQ3MS5RJBNAYxLbbxnsYSGtd8zIsdV
wk6jbNYeT2MUZq2tYkjn3b9B6+91FFMZq6q+KDOfNPqcKZyP4n5o5QSewznBvQwt
dHvjGgegJDjrrtuhLSQKG/uvSSi2hN9S5ibSMCa004GnH6P+uk/eICpvUXwNCyjV
ygogYTmQQqAEqnlqVNdQxo+DFYbaxKCw12VSoBeOsEySljPdc136hP/j7Tzbf2em
rkqtyXwng1+yG0vozMCAkyP5l3uA+HUculQLdmO8/55eia5Dl/zgsp3SvW7/2ONS
0DRfGo0ghoZgId1oDu6DGPsX80wKKskerJpTN/tHWTXQWeUXCNXrX//lhrFiwd7P
mHiyuk+INw3LQBkTlf7XhAf28w/9/+gCm3prEGnUCmLaJOeZ8HtL0mwDzudgc9Ca
NW/b3tdt4JU3oXKyyqywr4XAYfxlfmyf3DrBMnuHdTgccaB9PAAzugjmDnFJOuzk
jQw/Qfd6w7ZgVcVoaNQjjeogMTryGthCOPe9DzPUgkr+jCDsMwXopCvxbhbWI9e5
L1/U5ilka/VC2ZP7qZUvwsltCgp6RamhDb3yLZbn/2PKf0sFKVoI/j/g1qMnLNZt
TBNjzYuWAC8Hlw==
=4kDr
-----END PGP SIGNATURE-----
Merge tag 'objtool-urgent-2020-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 stack unwinding fix from Thomas Gleixner:
"A single bugfix for the ORC unwinder to ensure that the error flag
which tells the unwinding code whether a stack trace can be trusted or
not is always set correctly.
This was messed up by a couple of changes in the recent past"
* tag 'objtool-urgent-2020-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Fix error handling in __unwind_start()
stack protector enabled.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7A+q4ACgkQEsHwGGHe
VUpvtA/+NNPKVGSKZPdDlUm64JEPy7XrbzFJ+zigWGQjUPtZsDkAT4U33eQIvV5f
ea7vB2u+e7iRZBExgTI1JfyjTenGpBffhubR/ueawtxeTgvZSopFajHQir/VGPlJ
KQdtqe2wZek3Wux8BsKl8vcbqhgNH/LKgQzoG2y5P1LuA77MpFkMVkAoxKqbTDbt
Nx7j147ffZBJHfmUHz2/nWD9r0Exu+abeSPJeO4T52ImhVkr+Pd1nFS8S+mRCHMj
uJjxL/nB/sZmDDX+EX/zA7Du3ibaVa2po9cuhMTwNIPZIpak8Yyopl64fVm/N7jH
w0DIc1CgEaA1IkG7lwyKSgB/T6Fsg4SQp8gM4V3BkcTgVDuhTH0J/kGrOk2+YFSc
akk3420XBS4Q54BQ547woOImabxgQXDBvqBq+DhJFwP1qSllUXbZX7rlwZ3VQ160
sfmItVM0c4J9bgaXqZuwqHxJdgakaIECkXWZwpksQAzVxaOKpZo7drLq6SDhX9HH
BZdm/5AhIJ5rIGaiMXsZj5cC+H341N5TlaXA+I2b0r/vVOLtbe3it1rbSsvMoZJQ
7WOesyqFSjSObDUpXZ0riLl1X+rdrCAfzHsm5IMwLAoxmv80973johZKNZIgqIoh
CbPdyvaJoNK8FK6gT7bw3HNJ1ILGqk53jpWH1Gr1MlfzSzErOdQ=
=5Xi5
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single fix for early boot crashes of kernels built with gcc10 and
stack protector enabled"
* tag 'x86_urgent_for_v5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix early boot crash on gcc-10, third try
bugs, mostly for AMD processors. And a few other x86 fixes.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl6/0xcUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOZuwf/bQZw/SP9awLjOOVsRaSWUmwRGD4q
6KVq9+JYsPU4CyJ7P+vdsFF39a0ixoAnKWqRe/vsXdXZrdYCDUuQxh+7X+lmjKAb
dCQBnoqxI0w3yuxrm9Kn6Xs1AGIWibaRlZnXUKbuyn4ecFrh08OfYKGkYsEovhxK
G4ftY4/xyM7Qvm0fq7ZmzxPrkzd74HDZBvB83R6uiyPiX3w4O9qumqkUogcVXIJX
l3mnvSPClDDX4FOr8uhnU93varuR7Bek4Fh+Abj4uNks/F3z9ooJO9Hy9E+V5fhY
g6Oj2IrxDwJ2G6hqyucr1kujukJC1bX2nMZ1O4gNayXsxZEU/JtI0Y26SA==
=EzBt
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"A new testcase for guest debugging (gdbstub) that exposed a bunch of
bugs, mostly for AMD processors. And a few other x86 fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mce
KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c
KVM: SVM: Disable AVIC before setting V_IRQ
KVM: Introduce kvm_make_all_cpus_request_except()
KVM: VMX: pass correct DR6 for GD userspace exit
KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6
KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6
KVM: nSVM: trap #DB and #BP to userspace if guest debugging is on
KVM: selftests: Add KVM_SET_GUEST_DEBUG test
KVM: X86: Fix single-step with KVM_SET_GUEST_DEBUG
KVM: X86: Set RTM for DB_VECTOR too for KVM_EXIT_DEBUG
KVM: x86: fix DR6 delivery for various cases of #DB injection
KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properly
The signal return fast path directly restores user states from the user
buffer. Once that succeeds, restore supervisor states (but only when
they are not yet restored).
For the slow path, save supervisor states to preserve them across context
switches, and restore after the user states are restored.
The previous version has the overhead of an XSAVES in both the fast and the
slow paths. It is addressed as the following:
- In the fast path, only do an XRSTORS.
- In the slow path, do a supervisor-state-only XSAVES, and relocate the
buffer contents.
Some thoughts in the implementation:
- In the slow path, can any supervisor state become stale between
save/restore?
Answer: set_thread_flag(TIF_NEED_FPU_LOAD) protects the xstate buffer.
- In the slow path, can any code reference a stale supervisor state
register between save/restore?
Answer: In the current lazy-restore scheme, any reference to xstate
registers needs fpregs_lock()/fpregs_unlock() and __fpregs_load_activate().
- Are there other options?
One other option is eagerly restoring all supervisor states.
Currently, CET user-mode states and ENQCMD's PASID do not need to be
eagerly restored. The upcoming CET kernel-mode states (24 bytes) need
to be eagerly restored. To me, eagerly restoring all supervisor states
adds more overhead then benefit at this point.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20200512145444.15483-11-yu-cheng.yu@intel.com
The signal return code is responsible for taking an XSAVE buffer
present in user memory and loading it into the hardware registers. This
operation only affects user XSAVE state and never affects supervisor
state.
The fast path through this code simply points XRSTOR directly at the
user buffer. However, since user memory is not guaranteed to be always
mapped, this XRSTOR can fail. If it fails, the signal return code falls
back to a slow path which can tolerate page faults.
That slow path copies the xfeatures one by one out of the user buffer
into the task's fpu state area. However, by being in a context where it
can handle page faults, the code can also schedule.
The lazy-fpu-load code would think it has an up-to-date fpstate and
would fail to save the supervisor state when scheduling the task out.
When scheduling back in, it would likely restore stale supervisor state.
To fix that, preserve supervisor state before the slow path. Modify
copy_user_to_fpregs_zeroing() so that if it fails, fpregs are not zeroed,
and there is no need for fpregs_deactivate() and supervisor states are
preserved.
Move set_thread_flag(TIF_NEED_FPU_LOAD) to the slow path. Without doing
this, the fast path also needs supervisor states to be saved first.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200512145444.15483-10-yu-cheng.yu@intel.com
The XSAVES instruction takes a mask and saves only the features specified
in that mask. The kernel normally specifies that all features be saved.
XSAVES also unconditionally uses the "compacted format" which means that
all specified features are saved next to each other in memory. If a
feature is removed from the mask, all the features after it will "move
up" into earlier locations in the buffer.
Introduce copy_supervisor_to_kernel(), which saves only supervisor states
and then moves those states into the standard location where they are
normally found.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200512145444.15483-9-yu-cheng.yu@intel.com
Move the bpf verifier trace check into the new switch statement in
HEAD.
Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix sk_psock reference count leak on receive, from Xiyu Yang.
2) CONFIG_HNS should be invisible, from Geert Uytterhoeven.
3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this,
from Maciej Żenczykowski.
4) ipv4 route redirect backoff wasn't actually enforced, from Paolo
Abeni.
5) Fix netprio cgroup v2 leak, from Zefan Li.
6) Fix infinite loop on rmmod in conntrack, from Florian Westphal.
7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet.
8) Various bpf probe handling fixes, from Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits)
selftests: mptcp: pm: rm the right tmp file
dpaa2-eth: properly handle buffer size restrictions
bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier
bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range
bpf: Restrict bpf_probe_read{, str}() only to archs where they work
MAINTAINERS: Mark networking drivers as Maintained.
ipmr: Add lockdep expression to ipmr_for_each_table macro
ipmr: Fix RCU list debugging warning
drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c
net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810
tcp: fix error recovery in tcp_zerocopy_receive()
MAINTAINERS: Add Jakub to networking drivers.
MAINTAINERS: another add of Karsten Graul for S390 networking
drivers: ipa: fix typos for ipa_smp2p structure doc
pppoe: only process PADT targeted at local interfaces
selftests/bpf: Enforce returning 0 for fentry/fexit programs
bpf: Enforce returning 0 for fentry/fexit progs
net: stmmac: fix num_por initialization
security: Fix the default value of secid_to_secctx hook
libbpf: Fix register naming in PT_REGS s390 macros
...
The Intel C620 Platform Controller Hub has MROM functions that have non-PCI
registers (undocumented in the public spec) where BAR 0 is supposed to be,
which results in messages like this:
pci 0000:00:11.0: [Firmware Bug]: reg 0x30: invalid BAR (can't size)
Mark these MROM functions as having non-compliant BARs so we don't try to
probe any of them. There are no other BARs on these devices.
See the Intel C620 Series Chipset Platform Controller Hub Datasheet,
May 2019, Document Number 336067-007US, sec 2.1, 35.5, 35.6.
[bhelgaas: commit log, add 0xa26d]
Link: https://lore.kernel.org/r/1589513467-17070-1-git-send-email-lixiaochun.2888@163.com
Signed-off-by: Xiaochun Lee <lixc17@lenovo.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Bank_num is a one-based count of banks, not a zero-based index. It
overflows the allocated space only when strictly greater than
KVM_MAX_MCE_BANKS.
Fixes: a9e38c3e01 ("KVM: x86: Catch potential overrun in MCE setup")
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20200511225616.19557-1-jmattson@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Two new stats for exposing halt-polling cpu usage:
halt_poll_success_ns
halt_poll_fail_ns
Thus sum of these 2 stats is the total cpu time spent polling. "success"
means the VCPU polled until a virtual interrupt was delivered. "fail"
means the VCPU had to schedule out (either because the maximum poll time
was reached or it needed to yield the CPU).
To avoid touching every arch's kvm_vcpu_stat struct, only update and
export halt-polling cpu usage stats if we're on x86.
Exporting cpu usage as a u64 and in nanoseconds means we will overflow at
~500 years, which seems reasonably large.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Jon Cargille <jcargill@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200508182240.68440-1-jcargill@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The hrtimer used to emulate the VMX-preemption timer must be pinned to
the same logical processor as the vCPU thread to be interrupted if we
want to have any hope of adhering to the architectural specification
of the VMX-preemption timer. Even with this change, the emulated
VMX-preemption timer VM-exit occasionally arrives too late.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20200508203643.85477-4-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Prepare for migration of this hrtimer, by changing it from relative to
absolute. (I couldn't get migration to work with a relative timer.)
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20200508203643.85477-3-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The PINNED bit is ignored by hrtimer_init. It is only considered when
starting the timer.
When the hrtimer isn't pinned to the same logical processor as the
vCPU thread to be interrupted, the emulated VMX-preemption timer
often fails to adhere to the architectural specification.
Fixes: f15a75eedc ("KVM: nVMX: make emulated nested preemption timer pinned")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20200508203643.85477-2-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove a 'struct kvm_x86_ops' param that got left behind when the nested
ops were moved to their own struct.
Fixes: 33b2217245 ("KVM: x86: move nested-related kvm_x86_ops to a separate struct")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200506204653.14683-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This has already been handled in the prior call to svm_clear_vintr().
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <1588771076-73790-5-git-send-email-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Code clean up and remove unnecessary intercept check for
INTERCEPT_VINTR.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <1588771076-73790-4-git-send-email-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch implements a fastpath for the preemption timer vmexit. The vmexit
can be handled quickly so it can be performed with interrupts off and going
back directly to the guest.
Testing on SKX Server.
cyclictest in guest(w/o mwait exposed, adaptive advance lapic timer is default -1):
5540.5ns -> 4602ns 17%
kvm-unit-test/vmexit.flat:
w/o avanced timer:
tscdeadline_immed: 3028.5 -> 2494.75 17.6%
tscdeadline: 5765.7 -> 5285 8.3%
w/ adaptive advance timer default -1:
tscdeadline_immed: 3123.75 -> 2583 17.3%
tscdeadline: 4663.75 -> 4537 2.7%
Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-8-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch implements a fast path for emulation of writes to the TSCDEADLINE
MSR. Besides shortcutting various housekeeping tasks in the vCPU loop,
the fast path can also deliver the timer interrupt directly without going
through KVM_REQ_PENDING_TIMER because it runs in vCPU context.
Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-7-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the ad hoc test in vmx_set_hv_timer with a test in the caller,
start_hv_timer. This test is not Intel-specific and would be duplicated
when introducing the fast path for the TSC deadline MSR.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While optimizing posted-interrupt delivery especially for the timer
fastpath scenario, I measured kvm_x86_ops.deliver_posted_interrupt()
to introduce substantial latency because the processor has to perform
all vmentry tasks, ack the posted interrupt notification vector,
read the posted-interrupt descriptor etc.
This is not only slow, it is also unnecessary when delivering an
interrupt to the current CPU (as is the case for the LAPIC timer) because
PIR->IRR and IRR->RVI synchronization is already performed on vmentry
Therefore skip kvm_vcpu_trigger_posted_interrupt in this case, and
instead do vmx_sync_pir_to_irr() on the EXIT_FASTPATH_REENTER_GUEST
fastpath as well.
Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-6-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds a fastpath_t typedef since enum lines are a bit long, and replace
EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values.
- EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop,
but it would skip invoking the exit handler.
- EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered
without invoking the exit handler or going
back to vcpu_run
Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-4-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce kvm_vcpu_exit_request() helper, we need to check some conditions
before enter guest again immediately, we skip invoking the exit handler and
go through full run loop if complete fastpath but there is stuff preventing
we enter guest again immediately.
Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-5-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use __print_flags() to display the names of VMX flags in VM-Exit traces
and strip the flags when printing the basic exit reason, e.g. so that a
failed VM-Entry due to invalid guest state gets recorded as
"INVALID_STATE FAILED_VMENTRY" instead of "0x80000021".
Opportunstically fix misaligned variables in the kvm_exit and
kvm_nested_vmexit_inject tracepoints.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200508235348.19427-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce generic fastpath handler to handle MSR fastpath, VMX-preemption
timer fastpath etc; move it after vmx_complete_interrupts() in order to
catch events delivered to the guest, and abort the fast path in later
patches. While at it, move the kvm_exit tracepoint so that it is printed
for fastpath vmexits as well.
There is no observed performance effect for the IPI fastpath after this patch.
Tested-by: Haiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <1588055009-12677-2-git-send-email-wanpengli@tencent.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>