It's really not necessary to limit E820_X_MAX to 128 in the non-EFI
case. This commit drops E820_X_MAX's dependency on CONFIG_EFI, so that
E820_X_MAX is always at least slightly larger than E820MAX.
The real motivation behind this is actually to prevent some issues in
the Xen kernel, where the XENMEM_machine_memory_map hypercall can
produce an e820 map larger than 128 entries, even on systems where the
original e820 table was quite a bit smaller than that, depending on how
many IOAPICs are installed on the system.
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Ingo Molnar <mingo@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
as implemented in kvm_set_cr3, in several ways.
* different rules are followed to check CR3 and it is desirable for the caller
to distinguish between the possible failures
* PDPTRs are not loaded if PAE paging and nested EPT are both enabled
* many MMU operations are not necessary
This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
nested vmentry and vmexit, and makes use of it on the nested vmentry path.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The VMX capability MSRs advertise the set of features the KVM virtual
CPU can support. This set of features varies across different host CPUs
and KVM versions. This patch aims to addresses both sources of
differences, allowing VMs to be migrated across CPUs and KVM versions
without guest-visible changes to these MSRs. Note that cross-KVM-
version migration is only supported from this point forward.
When the VMX capability MSRs are restored, they are audited to check
that the set of features advertised are a subset of what KVM and the
CPU support.
Since the VMX capability MSRs are read-only, they do not need to be on
the default MSR save/restore lists. The userspace hypervisor can set
the values of these MSRs or read them from KVM at VCPU creation time,
and restore the same value after every save/restore.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Once skipping the emulated instruction can potentially trigger an exit to
userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
propagate a return value.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
I recently encountered wreckage because access_ok() was used where it
should not be, add an explicit WARN when access_ok() is used wrongly.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYRIGyAAoJEHm+PkMAQRiG2ksH/jwMUT9j6glbwESxbn1YTqTM
QcBT5AMc7D0wNuidQe0hWZMtG4RbC+4ZhxzZl2wPgA2gueJ+rBnyX7bgtA7ka8ka
Fdc3u/Q1v38HPzf8iBnxcdCs40VgsoMLjFYCXrpOxuGDNKYzRd+Q8aI2TeGvzbyi
X8+6oAWifBwo2oA06jfcuUncEWbyDDyK9aQksmfKOpjHdb26yELPEhsPOlds1g7E
jYLnvUVnU2CoFaumta+rZQ0kzLdc4Ntu0wEao6WzJuQKsgoID+tS/6iudi8cUhDp
YowGAVoOfr6rAJB0mwrDVfugpamaT3386XKyocdNsK0/jR60UIJ8x+WzvvSU+lY=
=JTBj
-----END PGP SIGNATURE-----
Backmerge tag 'v4.9-rc8' into drm-next
Linux 4.9-rc8
Daniel requested this so we could apply some follow on fixes cleanly to -next.
intel_rdt_sched_in() must be called with preemption disabled because the
function accesses percpu variables (pqr_state and closid).
If a task moves itself via move_myself() preemption is enabled, which
violates the calling convention and can result in incorrect closid
selection when the task gets preempted or migrated.
Add the required protection and a comment about the calling convention.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Marcelo Tosatti" <mtosatti@redhat.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1480625714-54246-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add the missing return statement to the inline stub
tsc_store_and_check_tsc_adjust() and add the other stubs to make a
SMP=y,TSC=n build happy.
While at it, remove the unused variable from the UP variant of
tsc_store_and_check_tsc_adjust().
Fixes: commit ba75fb646931 ("x86/tsc: Sync test only for the first cpu in a package")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Rename CONFIG_SCHED_ITMT for Intel Turbo Boost Max Technology 3.0
to CONFIG_SCHED_MC_PRIO. This makes the configuration extensible
in future to other architectures that wish to similarly establish
CPU core priorities support in the scheduler.
The description in Kconfig is updated to reflect this change with
added details for better clarity. The configuration is explicitly
default-y, to enable the feature on CPUs that have this feature.
It has no effect on non-TBM3 CPUs.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: jolsa@redhat.com
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/2b2ee29d93e3f162922d72d0165a1405864fbb23.1480444902.git.tim.c.chen@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the TSC_ADJUST MSR is available all CPUs in a package are forced to the
same value. So TSCs cannot be out of sync when the first CPU in the package
was in sync.
That allows to skip the sync test for all CPUs except the first starting
CPU in a package.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20161119134017.809901363@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When entering idle, it's a good oportunity to verify that the TSC_ADJUST
MSR has not been tampered with (BIOS hiding SMM cycles). If tampering is
detected, emit a warning and restore it to the previous value.
This is especially important for machines, which mark the TSC reliable
because there is no watchdog clocksource available (SoCs).
This is not sufficient for HPC (NOHZ_FULL) situations where a CPU never
goes idle, but adding a timer to do the check periodically is not an option
either. On a machine, which has this issue, the check triggeres right
during boot, so there is a decent chance that the sysadmin will notice.
Rate limit the check to once per second and warn only once per cpu.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20161119134017.732180441@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The TSC_ADJUST MSR shows whether the TSC has been modified. This is helpful
in a two aspects:
1) It allows to detect BIOS wreckage, where SMM code tries to 'hide' the
cycles spent by storing the TSC value at SMM entry and restoring it at
SMM exit. On affected machines the TSCs run slowly out of sync up to the
point where the clocksource watchdog (if available) detects it.
The TSC_ADJUST MSR allows to detect the TSC modification before that and
eventually restore it. This is also important for SoCs which have no
watchdog clocksource and therefore TSC wreckage cannot be detected and
acted upon.
2) All threads in a package are required to have the same TSC_ADJUST
value. Broken BIOSes break that and as a result the TSC synchronization
check fails.
The TSC_ADJUST MSR allows to detect the deviation when a CPU comes
online. If detected set it to the value of an already online CPU in the
same package. This also allows to reduce the number of sync tests
because with that in place the test is only required for the first CPU
in a package.
In principle all CPUs in a system should have the same TSC_ADJUST value
even across packages, but with physical CPU hotplug this assumption is
not true because the TSC starts with power on, so physical hotplug has
to do some trickery to bring the TSC into sync with already running
packages, which requires to use an TSC_ADJUST value different from CPUs
which got powered earlier.
A final enhancement is the opportunity to compensate for unsynced TSCs
accross nodes at boot time and make the TSC usable that way. It won't
help for TSCs which run apart due to frequency skew between packages,
but this gets detected by the clocksource watchdog later.
The first step toward this is to store the TSC_ADJUST value of a starting
CPU and compare it with the value of an already online CPU in the same
package. If they differ, emit a warning and adjust it to the reference
value. The !SMP version just stores the boot value for later verification.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20161119134017.655323776@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds xts helpers that use the skcipher interface rather
than blkcipher. This will be used by aesni_intel.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In x86's include/asm/Kbuild three entries are appended to the genhdr-y make
variable:
genhdr-y += unistd_32.h
genhdr-y += unistd_64.h
genhdr-y += unistd_x32.h
The same entries are also appended to that variable in
include/uapi/asm/Kbuild. So commit:
10b63956fc ("UAPI: Plumb the UAPI Kbuilds into the user header installation and checking")
... removed these three entries from include/asm/Kbuild. But, apparently, some
merge conflict resolution re-added them.
The net effect is, in short, that the genhdr-y make variable contains these
file names twice and, as a consequence, that the corresponding headers get
installed twice. And so the build prints:
INSTALL usr/include/asm/ (65 files)
... while in reality only 62 files are installed in that directory.
Nothing breaks because of all that, but it's a good idea to finally remove
these unneeded entries nevertheless.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1480077707-2837-1-git-send-email-pebolle@tiscali.nl
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Intel Turbo Boost Max Technology 3.0 (ITMT) feature
allows some cores to be boosted to higher turbo
frequency than others.
Add /proc/sys/kernel/sched_itmt_enabled so operator
can enable/disable scheduling of tasks that favor cores
with higher turbo boost frequency potential.
By default, system that is ITMT capable and single
socket has this feature turned on. It is more likely
to be lightly loaded and operates in Turbo range.
When there is a change in the ITMT scheduling operation
desired, a rebuild of the sched domain is initiated
so the scheduler can set up sched domains with appropriate
flag to enable/disable ITMT scheduling operations.
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/07cc62426a28bad57b01ab16bb903a9c84fa5421.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum
turbo frequencies of some cores in a CPU package may be higher than for
the other cores in the same package. In that case, better performance
(and possibly lower energy consumption as well) can be achieved by
making the scheduler prefer to run tasks on the CPUs with higher max
turbo frequencies.
To that end, set up a core priority metric to abstract the core
preferences based on the maximum turbo frequency. In that metric,
the cores with higher maximum turbo frequencies are higher-priority
than the other cores in the same package and that causes the scheduler
to favor them when making load-balancing decisions using the asymmertic
packing approach. At the same time, the priority of SMT threads with a
higher CPU number is reduced so as to avoid scheduling tasks on all of
the threads that belong to a favored core before all of the other cores
have been given a task to run.
The priority metric will be initialized by the P-state driver with the
help of the sched_set_itmt_core_prio() function. The P-state driver
will also determine whether or not ITMT is supported by the platform
and will call sched_set_itmt_support() to indicate that.
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Update the I/O interception support to add the kvm_fast_pio_in function
to speed up the in instruction similar to the out instruction.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
AMD hardware adds two additional bits to aid in nested page fault handling.
Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables
The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.
Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Commit:
90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")
changed the coredumping code to construct the elf coredump file according
to register set size - and that's good: if binary crashes with 32-bit code
selector, generate 32-bit ELF core, otherwise - 64-bit core.
That was made for restoring 32-bit applications on x86_64: we want
32-bit application after restore to generate 32-bit ELF dump on crash.
All was quite good and recently I started reworking 32-bit applications
dumping part of CRIU: now it has two parasites (32 and 64) for seizing
compat/native tasks, after rework it'll have one parasite, working in
64-bit mode, to which 32-bit prologue long-jumps during infection.
And while it has worked for my work machine, in VM with
!CONFIG_X86_X32_ABI during reworking I faced that segfault in 32-bit
binary, that has long-jumped to 64-bit mode results in dereference
of garbage:
32-victim[19266]: segfault at f775ef65 ip 00000000f775ef65 sp 00000000f776aa50 error 14
BUG: unable to handle kernel paging request at ffffffffffffffff
IP: [<ffffffff81332ce0>] strlen+0x0/0x20
[...]
Call Trace:
[] elf_core_dump+0x11a9/0x1480
[] do_coredump+0xa6b/0xe60
[] get_signal+0x1a8/0x5c0
[] do_signal+0x23/0x660
[] exit_to_usermode_loop+0x34/0x65
[] prepare_exit_to_usermode+0x2f/0x40
[] retint_user+0x8/0x10
That's because we have 64-bit registers set (with according total size)
and we're writing it to elf_thread_core_info which has smaller size
on !CONFIG_X86_X32_ABI. That lead to overwriting ELF notes part.
Tested on 32-, 64-bit ELF crashes and on 32-bit binaries that have
jumped with 64-bit code selector - all is readable with gdb.
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: 90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Intel Xeons from Ivy Bridge onwards support a processor identification
number set in the factory. To the user this is a handy unique number to
identify a particular CPU. Intel can decode this to the fab/production
run to track errors. On systems that have it, include it in the machine
check record. I'm told that this would be helpful for users that run
large data centers with multi-socket servers to keep track of which CPUs
are seeing errors.
Boris:
* Add some clarifying comments and spacing.
* Mask out [63:2] in the disabled-but-not-locked case
* Call the MSR variable "val" for more readability.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/20161123114855.njguoaygp3qnbkia@pd.tnic
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
- Remove VMX_EPT_EXTENT_INDIVIDUAL_ADDR, since there is no such type of
EPT invalidation
- Add missing VPID types names
Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Tested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The Unified Memory Controllers (UMCs) on Fam17h log a normalized address
in their MCA_ADDR registers. We need to convert that normalized address
to a system physical address in order to support a few facilities:
1) To offline poisoned pages in DRAM proactively in the deferred error
handler.
2) To print sysaddr and page info for DRAM ECC errors in EDAC.
[ Boris: fixes/cleanups ontop:
* hi_addr_offset = 0 - no need for that branch. Stick it all under the
HiAddrOffsetEn case. It confines hi_addr_offset's declaration too.
* Move variables to the innermost scope they're used at so that we save
on stack and not blow it up immediately on function entry.
* Do not modify *sys_addr prematurely - we want to not exit early and
have modified *sys_addr some, which callers get to see. We either
convert to a sys_addr or we don't do anything. And we signal that with
the retval of the function.
* Rename label out -> out_err - because it is the error path.
* No need to pr_err of the conversion failed case: imagine a
sparsely-populated machine with UMCs which don't have DIMMs. Callers
should look at the retval instead and issue a printk only when really
necessary. No need for useless info in dmesg.
* s/temp_reg/tmp/ and other variable names shortening => shorter code.
* Use BIT() everywhere.
* Make error messages more informative.
* Small build fix for the !CONFIG_X86_MCE_AMD case.
* ... and more minor cleanups.
]
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20161122111133.mjzpvzhf7o7yl2oa@pd.tnic
[ Typo fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
NMI stack dumps are bracketed by the following tags:
<NMI>
...
<EOE>
The ending tag is kind of confusing if you don't already know what "EOE"
means (end of exception). The same ending tag is also used to mark the
end of all other exceptions' stacks. For example:
<#DF>
...
<EOE>
And similarly, "EOI" is used as the ending tag for interrupts:
<IRQ>
...
<EOI>
Change the tags to be more comprehensible by making them symmetrical and
more XML-esque:
<NMI>
...
</NMI>
<#DF>
...
</#DF>
<IRQ>
...
</IRQ>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/180196e3754572540b595bc56b947d43658979a7.1479491159.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The X86_FEATURE_TSC_RELIABLE flag in Linux kernel implies both reliable
(at runtime) and trustable (at calibration). But reliable running and
trustable calibration independent of each other.
Add a new flag X86_FEATURE_TSC_KNOWN_FREQ, which denotes that the frequency
is known (via MSR/CPUID). This flag is only meant to skip the long term
calibration on systems which have a known frequency.
Add X86_FEATURE_TSC_KNOWN_FREQ to the skip the delayed calibration and
leave X86_FEATURE_TSC_RELIABLE in place.
After converting the existing users of X86_FEATURE_TSC_RELIABLE to use
either both flags or just X86_FEATURE_TSC_KNOWN_FREQ we can seperate the
functionality.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bin Gao <bin.gao@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1479241644-234277-2-git-send-email-bin.gao@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tvrtko needs
commit b3c11ac267
Author: Eric Engestrom <eric@engestrom.ch>
Date: Sat Nov 12 01:12:56 2016 +0000
drm: move allocation out of drm_get_format_name()
to be able to apply his patches without conflicts.
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
RDPID is a new instruction that reads MSR_TSC_AUX quickly. This
should be considerably faster than reading the GDT. Add a
cpufeature for it and use it from __vdso_getcpu() when available.
Tested-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4f6c3a22012d10f1c65b9ca15800e01b42c7d39d.1479320367.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No need to duplicate the same define everywhere. Since
the only user is stop-machine and the only provider is
s390, we can use a default implementation of cpu_relax_yield()
in sched.h.
Suggested-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-s390 <linux-s390@vger.kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1479298985-191589-1-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a few new AVX512 instruction groups/features for enumeration in
/proc/cpuinfo: AVX512IFMA and AVX512VBMI.
Clear the flags in fpu_xstate_clear_all_cpu_caps().
CPUID.(EAX=7,ECX=0):EBX[bit 21] AVX512IFMA
CPUID.(EAX=7,ECX=0):ECX[bit 1] AVX512VBMI
Detailed information of cpuid bits for the features can be found at
https://bugzilla.kernel.org/show_bug.cgi?id=187891
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: mingo@elte.hu
Link: http://lkml.kernel.org/r/1479327060-18668-1-git-send-email-gayatri.kammela@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some devices on Fam17h can only be accessed through the System Management
Network (SMN). The SMN is accessed by a pair of index/data registers in PCI
config space. Add a pair of functions to read from and write to the SMN.
The Data Fabric on Fam17h allows multiple devices to use the same register
space. The registers of a specific device are accessed indirectly using the
device's DF InstanceId. Currently, we only need to read from these devices,
so only define a read function for now.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1478812257-5424-5-git-send-email-Yazen.Ghannam@amd.com
[ Boris: make __amd_smn_rw() even more compact. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Hide amd_northbridges in amd_nb.c so that external callers will have to
use the exported accessor functions.
Also, fix some checkpatch.pl warnings.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1478812257-5424-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Sparse populated CPUID leafs are collected in a software provided leaf to
avoid bloat of the x86_capability array, but there is no way to rebuild the
real leafs (e.g. for KVM CPUID enumeration) other than rereading the CPUID
leaf from the CPU. While this is possible it is problematic as it does not
take software disabled features into account. If a feature is disabled on
the host it should not be exposed to a guest either.
Add get_scattered_cpuid_leaf() which rebuilds the leaf from the scattered
cpuid table information and the active CPU features.
[ tglx: Rewrote changelog ]
Signed-off-by: He Chen <he.chen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Luwei Kang <luwei.kang@intel.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Piotr Luc <Piotr.Luc@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/r/1478856336-9388-3-git-send-email-he.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
cpuid_regs is defined multiple times as structure and enum. Rename the enum
and move all of it to processor.h so we don't end up with more instances.
Rename the misnomed register enumeration from CR_* to the obvious CPUID_*.
[ tglx: Rewrote changelog ]
Signed-off-by: He Chen <he.chen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Luwei Kang <luwei.kang@intel.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Piotr Luc <Piotr.Luc@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/r/1478856336-9388-2-git-send-email-he.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The per-cpu preempt count of x86 contains two values, the actual preempt
count and the inverted PREEMPT_NEED_RESCHED bit. If a corrupted preempt
count is detected the preempt_count_set() function is used to reset the
preempt count.
In case the inverted PREEMPT_NEED_RESCHED bit is zero at the time of the
reset, the preemption indication is lost. Use raw_cpu_cmpxchg_4() to reset
only the count part and leave the PREEMPT_NEED_RESCHED bit as it is.
This improves the kernel's behavior when it runs into preempt count leaks
and tries to fix them up.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1478523660-733-1-git-send-email-schwidefsky@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make the MSR argument an unsigned int, both low and high u32, put
"notrace" last in the function signature. Reflow function signatures for
better readability and cleanup white space.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As there are no users left, we can remove cpu_relax_lowlatency()
implementations from every architecture.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Cc: <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/1477386195-32736-6-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For spinning loops people do often use barrier() or cpu_relax().
For most architectures cpu_relax and barrier are the same, but on
some architectures cpu_relax can add some latency.
For example on power,sparc64 and arc, cpu_relax can shift the CPU
towards other hardware threads in an SMT environment.
On s390 cpu_relax does even more, it uses an hypercall to the
hypervisor to give up the timeslice.
In contrast to the SMT yielding this can result in larger latencies.
In some places this latency is unwanted, so another variant
"cpu_relax_lowlatency" was introduced. Before this is used in more
and more places, lets revert the logic and provide a cpu_relax_yield
that can be called in places where yielding is more important than
latency. By default this is the same as cpu_relax on all architectures.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1477386195-32736-2-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The threshold_cpu_callback callbacks looks like one of the notifier and
its arguments are almost the same. Split this out and have one ONLINE
and one DEAD callback. This will come handy later once the main code
gets changed to use the callback mechanism.
Also, handle threshold_cpu_callback_online() return value so we don't
continue if the function fails.
Boris Petkov removed the callback pointer and replaced it with proper
functions.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: rt@linutronix.de
Cc: linux-edac@vger.kernel.org
Link: http://lkml.kernel.org/r/20161110174447.11848-5-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We already have a macro to invoke boot services which on x86 adapts
automatically to the bitness of the EFI firmware: efi_call_early().
The macro allows sharing of functions across arches and bitness variants
as long as those functions only call boot services. However in practice
functions in the EFI stub contain a mix of boot services calls and
protocol calls.
Add an efi_call_proto() macro for bitness-agnostic protocol calls to
allow sharing more code across arches as well as deduplicating 32 bit
and 64 bit code paths.
On x86, implement it using a new efi_table_attr() macro for bitness-
agnostic table lookups. Refactor efi_call_early() to make use of the
same macro. (The resulting object code remains identical.)
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andreas Noever <andreas.noever@gmail.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Jones <pjones@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20161112213237.8804-8-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The calculation of the hwid_mcatype value in get_smca_bank_info()
became incorrect after applying the following commit:
1ce9cd7f9f ("x86/RAS: Simplify SMCA HWID descriptor struct")
This causes the function to not match a bank to its type.
Disassembly of hwid_mcatype calculation after change:
db: 8b 45 e0 mov -0x20(%rbp),%eax
de: 41 89 c4 mov %eax,%r12d
e1: 25 00 00 ff 0f and $0xfff0000,%eax
e6: 41 c1 ec 10 shr $0x10,%r12d
ea: 41 09 c4 or %eax,%r12d
Disassembly of hwid_mcatype calculation in original code:
286: 8b 45 d0 mov -0x30(%rbp),%eax
289: 41 89 c5 mov %eax,%r13d
28c: c1 e8 10 shr $0x10,%eax
28f: 41 81 e5 ff 0f 00 00 and $0xfff,%r13d
296: 41 c1 e5 10 shl $0x10,%r13d
29a: 41 09 c5 or %eax,%r13d
Grouping the arguments to the HWID_MCATYPE() macro fixes the issue.
( Boris suggested adding parentheses in the macro. )
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following RCU lockdep warning led to adding irq_enter()/irq_exit() into
smp_reschedule_interrupt():
RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 0
RCU used illegally from extended quiescent state!
no locks held by swapper/1/0.
do_trace_write_msr
native_write_msr
native_apic_msr_eoi_write
smp_reschedule_interrupt
reschedule_interrupt
As Peterz pointed out:
| So now we're making a very frequent interrupt slower because of debug
| code.
|
| The thing is, many many smp_reschedule_interrupt() invocations don't
| actually execute anything much at all and are only sent to tickle the
| return to user path (which does the actual preemption).
|
| Having to do the whole irq_enter/irq_exit dance just for this unlikely
| debug case totally blows.
Use the wrmsr_notrace() variant in native_apic_msr_write_eoi, annotate the
kvm variant with notrace and add a native_apic_eoi callback to the apic
structure so KVM guests are covered as well.
This allows to revert the irq_enter/irq_exit dance in
smp_reschedule_interrupt().
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Mike Galbraith <efault@gmx.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1478488420-5982-3-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Required to remove the extra irq_enter()/irq_exit() in
smp_reschedule_interrupt().
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: kvm@vger.kernel.org
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1478488420-5982-2-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add accessor functions and hide the smca_names array. Also, add a
sanity-check to bank HWID assignment in get_smca_bank_info().
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20161104152317.5r276t35df53qk76@pd.tnic
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make it differ more from struct smca_bank_name for better readability.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20161103125556.15482-3-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Call it simply smca_hwid and call local variables "hwid". More readable.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20161103125556.15482-2-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Call the struct simply smca_bank, it's instance ID can be simply ->id.
Makes the code much more readable.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20161103125556.15482-1-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit cc7cc02bad ("PCI: Query platform firmware for device power
state") augmented struct pci_platform_pm_ops with a ->get_state hook and
implemented it for acpi_pci_platform_pm, the only pci_platform_pm_ops
existing till v4.7.
However v4.8 introduced another pci_platform_pm_ops for Intel Mobile
Internet Devices with commit 5823d0893e ("x86/platform/intel-mid: Add
Power Management Unit driver"). It is missing the ->get_state hook,
which is fatal since pci_set_platform_pm() enforces its presence. Andy
Shevchenko reports that without the present commit, such a device
"crashes without even a character printed out on serial console and
reboots (since watchdog)".
Retrofit mid_pci_platform_pm with the missing callback to fix the
breakage.
Acked-and-tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Fixes: cc7cc02bad ("PCI: Query platform firmware for device power state")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: http://lkml.kernel.org/r/7c1567d4c49303a4aada94ba16275cbf56b8976b.1477221514.git.lukas@wunner.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
... to fix this build warning:
In file included from arch/x86/kernel/cpu/intel_rdt.c:33:0:
./arch/x86/include/asm/intel_rdt.h:56:25: warning: ‘struct kernfs_open_file’ declared inside parameter list will not be visible outside of this definition or declaration
int (*seq_show)(struct kernfs_open_file *of,
^~~~~~~~~~~~~~~~
...
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <h.peter.anvin@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi V Shankar <ravi.v.shankar@intel.com>
Cc: Sai Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/20161102165117.23545-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
during the merge window. The rest are fixes for MIPS, s390 and nested VMX.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJYG2H5AAoJEL/70l94x66DK/cH/0jEQ3ynuLAd5CKux7JxI/EP
msSJh1Xqr4+XhXZnuDpGQWrdsBlxoiqA6PsJrUTtyi4nQCDXlT8g+2MDuvqhWIHz
7vw58j/EMJDCVQzYAbN5VDUfk13uB5aSWTo3M9Rf09v0hU1Ql7z8u4CtKEdLpN5Y
LY9bT9fxUmXO7REKP7bdW6ZrDX/hUShYHgMqzXGFMyGBG3ym3a9bggXEzTCD6eNQ
ioogQIWqg+icdhta0iLNAwFClPlcKB2/xo4IUuNgrPwGoHFGJN/8+qxT4+sVbp2B
v8u1zOXlCFXBcskWE+yRRsGe72+mIzz6QScCyO+5HbhKYVfbE9H7KBlFX9rZZ2c=
=IbKx
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"One NULL pointer dereference, and two fixes for regressions introduced
during the merge window.
The rest are fixes for MIPS, s390 and nested VMX"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: x86: Check memopp before dereference (CVE-2016-8630)
kvm: nVMX: VMCLEAR an active shadow VMCS after last use
KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK
KVM: x86: fix wbinvd_dirty_mask use-after-free
kvm/x86: Show WRMSR data is in hex
kvm: nVMX: Fix kernel panics induced by illegal INVEPT/INVVPID types
KVM: document lock orders
KVM: fix OOPS on flush_work
KVM: s390: Fix STHYI buffer alignment for diag224
KVM: MIPS: Precalculate MMIO load resume PC
KVM: MIPS: Make ERET handle ERL before EXL
KVM: MIPS: Fix lazy user ASID regenerate for SMP
The user of page_track might needs extra information, so pass
the kvm_page_track_notifier_node to callbacks.
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a memory slot is being moved or removed users of page track
can be notified. So users can drop write-protection for the pages
in that memory slot.
This notifier type is needed by KVMGT to sync up its shadow page
table when memory slot is being moved or removed.
Register the notifier type track_flush_slot to receive memslot move
and remove event.
Reviewed-by: Xiao Guangrong <guangrong.xiao@intel.com>
Signed-off-by: Chen Xiaoguang <xiaoguang.chen@intel.com>
[Squashed commits to avoid bisection breakage and reworded the subject.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
These are never used by the host, but they can still be reflected to
the guest.
Tested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a memory slot is being moved or removed users of page track
can be notified. So users can drop write-protection for the pages
in that memory slot.
This notifier type is needed by KVMGT to sync up its shadow page
table when memory slot is being moved or removed.
Register the notifier type track_flush_slot to receive memslot move
and remove event.
Reviewed-by: Xiao Guangrong <guangrong.xiao@intel.com>
Signed-off-by: Chen Xiaoguang <xiaoguang.chen@intel.com>
[Squashed commits to avoid bisection breakage and reworded the subject.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Since commit a545ab6a00 ("kvm: x86: add tsc_offset field to struct
kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
cached and need not be fished out of the VMCS or VMCB. This means
that we can implement adjust_tsc_offset_guest and read_l1_tsc
entirely in generic code. The simplification is particularly
significant for VMX code, where vmx->nested.vmcs01_tsc_offset
was duplicating what is now in vcpu->arch.tsc_offset. Therefore
the vmcs01_tsc_offset can be dropped completely.
More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
which, after commit 108b249c45 ("KVM: x86: introduce get_kvmclock_ns",
2016-09-01) called read_l1_tsc while the VMCS was not loaded.
It thus returned bogus values on Intel CPUs.
Fixes: 108b249c45
Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kernel doesn't use clts() any more. Remove it and all of its
paravirt infrastructure.
A careful reader may notice that xen_clts() appears to have been
buggy -- it didn't update xen_cr0_value.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm list <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/3d3c8ca62f17579b9849a013d71e59a4d5d1b079.1477951965.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It has no callers any more, and it was always a bit confusing, as
there is no STTS instruction.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm list <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/04247401710b230849e58bf2112ce4fd0b9840e1.1477951965.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that Linux never sets CR0.TS, lguest doesn't need to support it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm list <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/8a7bf2c11231c082258fd67705d0f275639b8475.1477951965.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that lazy FPU is gone, we don't use CR0.TS (except possibly in
KVM guest mode). Remove irq_ts_save(), irq_ts_restore(), and all of
their callers.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm list <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/70b9b9e7ba70659bedcb08aba63d0f9214f338f2.1477951965.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Hook the x86 scheduler code to update closid based on whether the current
task is assigned to a specific closid or running on a CPU assigned to a
specific closid.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Shaohua Li" <shli@fb.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Dave Hansen" <dave.hansen@intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: "Nilay Vaish" <nilayvaish@gmail.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: "Borislav Petkov" <bp@suse.de>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1477692289-37412-10-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Last of the per resource group files. Also mode 0644. This one shows
the resources available to the group. Syntax depends on whether the
"cdp" mount option was given. With code/data prioritization disabled
it is simply a list of masks for each cache domain. Initial value
allows access to all of the L3 cache on all domains. E.g. on a 2 socket
Broadwell:
L3:0=fffff;1=fffff
With CDP enabled, separate masks for data and instructions are provided:
L3DATA:0=fffff;1=fffff
L3CODE:0=fffff;1=fffff
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Shaohua Li" <shli@fb.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Dave Hansen" <dave.hansen@intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: "Nilay Vaish" <nilayvaish@gmail.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: "Borislav Petkov" <bp@suse.de>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1477692289-37412-9-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now we populate each directory with a read/write (mode 0644) file
named "cpus". This is used to over-ride the resources available
to processes in the default resource group when running on specific
CPUs. Each "cpus" file reads as a cpumask showing which CPUs belong
to this resource group. Initially all online CPUs are assigned to
the default group. They can be added to other groups by writing a
cpumask to the "cpus" file in the directory for the resource group
(which will remove them from the previous group to which they were
assigned). CPU online/offline operations will delete CPUs that go
offline from whatever group they are in and add new CPUs to the
default group.
If there are CPUs assigned to a group when the directory is removed,
they are returned to the default group.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Shaohua Li" <shli@fb.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Dave Hansen" <dave.hansen@intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: "Nilay Vaish" <nilayvaish@gmail.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: "Borislav Petkov" <bp@suse.de>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1477692289-37412-7-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Resource control groups are represented as directories in the resctrl
file system. The root directory describes the default resources available
to tasks that have not been assigned specific resources. Other directories
can be created at the root level to make new resource groups. It is not
permitted to make directories within other directories.
Hardware uses a CLOSID (Class of service ID) to determine which resource
limits are currently in effect. The exact number available is enumerated
by CPUID leaf 0x10, but on current implementations it is a small number.
We implement a simple bitmask allocator for CLOSIDs.
Each resource control group uses one CLOSID, which limits the total number
of directories that can be created.
Resource groups can be removed using rmdir.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Shaohua Li" <shli@fb.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Dave Hansen" <dave.hansen@intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: "Nilay Vaish" <nilayvaish@gmail.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: "Borislav Petkov" <bp@suse.de>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1477692289-37412-6-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use kernfs as basis for our user interface filesystem. This patch
supports mount/umount, and one mount parameter "cdp" to enable code/data
prioritization (though all we do at this point is ensure that the system
can support CDP). The file system is not populated yet in this patch.
[ tglx: Fixed up a few nits and added cdp handling in case of error ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Shaohua Li" <shli@fb.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Dave Hansen" <dave.hansen@intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: "Nilay Vaish" <nilayvaish@gmail.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: "Borislav Petkov" <bp@suse.de>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1477692289-37412-4-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We use the cpu hotplug notifier to catch each cpu in turn and look at
its cache topology w.r.t each of the resource groups. As we discover
new resources, we initialize the bitmask array for each to the default
(full access) value.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Shaohua Li" <shli@fb.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Dave Hansen" <dave.hansen@intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: "Nilay Vaish" <nilayvaish@gmail.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: "Borislav Petkov" <bp@suse.de>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1477692289-37412-3-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
PQR_ASSOC MSR contains the RMID used for preformance monitoring of cache
occupancy and memory bandwidth. The upper 32bit of this MSR contain the
CLOSID for cache allocation. So we need to share the information between
the two facilities.
Move the rdt data structure declaration into the shared header file and
make the per cpu data structure containing the MSR values global.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Dave Hansen" <dave.hansen@intel.com>
Cc: "Shaohua Li" <shli@fb.com>
Cc: "Nilay Vaish" <nilayvaish@gmail.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: "Borislav Petkov" <bp@suse.de>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1477142405-32078-10-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some Haswell generation CPUs support RDT, but they don't enumerate this via
CPUID. Use rdmsr_safe() and wrmsr_safe() to probe the MSRs on cpu model 63
(INTEL_FAM6_HASWELL_X)
Move the relevant defines into a common header file which is shared between
RDT/CQM and RDT/Allocation to avoid duplication.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Dave Hansen" <dave.hansen@intel.com>
Cc: "Shaohua Li" <shli@fb.com>
Cc: "Nilay Vaish" <nilayvaish@gmail.com>
Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: "Borislav Petkov" <bp@suse.de>
Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
Link: http://lkml.kernel.org/r/1477142405-32078-8-git-send-email-fenghua.yu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
A recent change to the mm code in:
87744ab383 mm: fix cache mode tracking in vm_insert_mixed()
started enforcing checking the memory type against the registered list for
amixed pfn insertion mappings. It happens that the drm drivers for a number
of gpus relied on this being broken. Currently the driver only inserted
VRAM mappings into the tracking table when they came from the kernel,
and userspace mappings never landed in the table. This led to a regression
where all the mapping end up as UC instead of WC now.
I've considered a number of solutions but since this needs to be fixed
in fixes and not next, and some of the solutions were going to introduce
overhead that hadn't been there before I didn't consider them viable at
this stage. These mainly concerned hooking into the TTM io reserve APIs,
but these API have a bunch of fast paths I didn't want to unwind to add
this to.
The solution I've decided on is to add a new API like the arch_phys_wc
APIs (these would have worked but wc_del didn't take a range), and
use them from the drivers to add a WC compatible mapping to the table
for all VRAM on those GPUs. This means we can then create userspace
mapping that won't get degraded to UC.
v1.1: use CONFIG_X86_PAT + add some comments in io.h
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: x86@kernel.org
Cc: mcgrof@suse.com
Cc: Dan Williams <dan.j.williams@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
For mostly historical reasons, the x86 oops dump shows the raw stack
values:
...
[registers]
Stack:
ffff880079af7350 ffff880079905400 0000000000000000 ffffc900008f3ae0
ffffffffa0196610 0000000000000001 00010000ffffffff 0000000087654321
0000000000000002 0000000000000000 0000000000000000 0000000000000000
Call Trace:
...
This seems to be an artifact from long ago, and probably isn't needed
anymore. It generally just adds noise to the dump, and it can be
actively harmful because it leaks kernel addresses.
Linus says:
"The stack dump actually goes back to forever, and it used to be
useful back in 1992 or so. But it used to be useful mainly because
stacks were simpler and we didn't have very good call traces anyway. I
definitely remember having used them - I just do not remember having
used them in the last ten+ years.
Of course, it's still true that if you can trigger an oops, you've
likely already lost the security game, but since the stack dump is so
useless, let's aim to just remove it and make games like the above
harder."
This also removes the related 'kstack=' cmdline option and the
'kstack_depth_to_print' sysctl.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e83bd50df52d8fe88e94d2566426ae40d813bf8f.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Printing kernel text addresses in stack dumps is of questionable value,
especially now that address randomization is becoming common.
It can be a security issue because it leaks kernel addresses. It also
affects the usefulness of the stack dump. Linus says:
"I actually spend time cleaning up commit messages in logs, because
useless data that isn't actually information (random hex numbers) is
actively detrimental.
It makes commit logs less legible.
It also makes it harder to parse dumps.
It's not useful. That makes it actively bad.
I probably look at more oops reports than most people. I have not
found the hex numbers useful for the last five years, because they are
just randomized crap.
The stack content thing just makes code scroll off the screen etc, for
example."
The only real downside to removing these addresses is that they can be
used to disambiguate duplicate symbol names. However such cases are
rare, and the context of the stack dump should be enough to be able to
figure it out.
There's now a 'faddr2line' script which can be used to convert a
function address to a file name and line:
$ ./scripts/faddr2line ~/k/vmlinux write_sysrq_trigger+0x51/0x60
write_sysrq_trigger+0x51/0x60:
write_sysrq_trigger at drivers/tty/sysrq.c:1098
Or gdb can be used:
$ echo "list *write_sysrq_trigger+0x51" |gdb ~/k/vmlinux |grep "is in"
(gdb) 0xffffffff815b5d83 is in driver_probe_device (/home/jpoimboe/git/linux/drivers/base/dd.c:378).
(But note that when there are duplicate symbol names, gdb will only show
the first symbol it finds. faddr2line is recommended over gdb because
it handles duplicates and it also does function size checking.)
Here's an example of what a stack dump looks like after this change:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: sysrq_handle_crash+0x45/0x80
PGD 36bfa067 [ 29.650644] PUD 7aca3067
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: ...
CPU: 1 PID: 786 Comm: bash Tainted: G E 4.9.0-rc1+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
task: ffff880078582a40 task.stack: ffffc90000ba8000
RIP: 0010:sysrq_handle_crash+0x45/0x80
RSP: 0018:ffffc90000babdc8 EFLAGS: 00010296
RAX: ffff880078582a40 RBX: 0000000000000063 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000292
RBP: ffffc90000babdc8 R08: 0000000b31866061 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000007 R14: ffffffff81ee8680 R15: 0000000000000000
FS: 00007ffb43869700(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000007a3e9000 CR4: 00000000001406e0
Stack:
ffffc90000babe00 ffffffff81572d08 ffffffff81572bd5 0000000000000002
0000000000000000 ffff880079606600 00007ffb4386e000 ffffc90000babe20
ffffffff81573201 ffff880036a3fd00 fffffffffffffffb ffffc90000babe40
Call Trace:
__handle_sysrq+0x138/0x220
? __handle_sysrq+0x5/0x220
write_sysrq_trigger+0x51/0x60
proc_reg_write+0x42/0x70
__vfs_write+0x37/0x140
? preempt_count_sub+0xa1/0x100
? __sb_start_write+0xf5/0x210
? vfs_write+0x183/0x1a0
vfs_write+0xb8/0x1a0
SyS_write+0x58/0xc0
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x7ffb42f55940
RSP: 002b:00007ffd33bb6b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007ffb42f55940
RDX: 0000000000000002 RSI: 00007ffb4386e000 RDI: 0000000000000001
RBP: 0000000000000011 R08: 00007ffb4321ea40 R09: 00007ffb43869700
R10: 00007ffb43869700 R11: 0000000000000246 R12: 0000000000778a10
R13: 00007ffd33bb5c00 R14: 0000000000000007 R15: 0000000000000010
Code: 34 e8 d0 34 bc ff 48 c7 c2 3b 2b 57 81 be 01 00 00 00 48 c7 c7 e0 dd e5 81 e8 a8 55 ba ff c7 05 0e 3f de 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 e8 4c 49 bc ff 84 c0 75 c3 48 c7
RIP: sysrq_handle_crash+0x45/0x80 RSP: ffffc90000babdc8
CR2: 0000000000000000
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/69329cb29b8f324bb5fcea14d61d224807fb6488.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yeah, I know, I know, this is a huuge patch and reviewing it is hard.
Sorry but this is the only way I could think of in which I can rewrite
the microcode patches loading procedure without breaking (knowingly) the
driver.
So maybe this patch is easier to review if one looks at the files after
the patch has been applied instead at the diff. Because then it becomes
pretty obvious:
* The BSP-loading path - load_ucode_bsp() is working independently from
the AP path now and it doesn't save any pointers or patches anymore -
it solely parses the builtin or initrd microcode and applies the patch.
That's it.
This fixes the CONFIG_RANDOMIZE_MEMORY offset fun more solidly.
* The AP-loading path - load_ucode_ap() then goes and scans
builtin/initrd *again* for the microcode patches but it caches them this
time so that we don't have to do that scan on each AP but only once.
This simplifies the code considerably.
Then, when we save the microcode from the initrd/builtin, we go and
add the relevant patches to our own cache. The AMD side did do that
and now the Intel side does it too. So no more pointer copying and
blabla, we save the microcode patches ourselves and are independent from
initrd/builtin.
This whole conversion gives us other benefits like unifying the
initrd parsing into a single function: find_microcode_in_initrd() is
used by both.
The diffstat speaks for itself: 456 insertions(+), 695 deletions(-)
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161025095522.11964-12-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Its functions are used in intel.c only now, so get rid of it. Make
functions static.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161025095522.11964-11-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make them all static as they're used in a single file now.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161025095522.11964-10-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Will be needed in a following patch.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161025095522.11964-7-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It will be used by both drivers so move it to core.c.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161025095522.11964-6-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the function declaration to the other #ifdef CONFIG_MICROCODE
together with the other functions.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161025095522.11964-5-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Its all generic atomic_long_t stuff now.
Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With frame pointers, when a task is interrupted, its stack is no longer
completely reliable because the function could have been interrupted
before it had a chance to save the previous frame pointer on the stack.
So the caller of the interrupted function could get skipped by a stack
trace.
This is problematic for live patching, which needs to know whether a
stack trace of a sleeping task can be relied upon. There's currently no
way to detect if a sleeping task was interrupted by a page fault
exception or preemption before it went to sleep.
Another issue is that when dumping the stack of an interrupted task, the
unwinder has no way of knowing where the saved pt_regs registers are, so
it can't print them.
This solves those issues by encoding the pt_regs pointer in the frame
pointer on entry from an interrupt or an exception.
This patch also updates the unwinder to be able to decode it, because
otherwise the unwinder would be broken by this change.
Note that this causes a change in the behavior of the unwinder: each
instance of a pt_regs on the stack is now considered a "frame". So
callers of unwind_get_return_address() will now get an occasional
'regs->ip' address that would have previously been skipped over.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/8b9f84a21e39d249049e0547b559ff8da0df0988.1476973742.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
c65eacbe29 ("sched/core: Allow putting thread_info into task_struct")
... made 'struct thread_info' a generic struct with only a
single ::flags member, if CONFIG_THREAD_INFO_IN_TASK_STRUCT=y is
selected.
This change however seems to be quite x86 centric, since at least the
generic preemption code (asm-generic/preempt.h) assumes that struct
thread_info also has a preempt_count member, which apparently was not
true for x86.
We could add a bit more #ifdefs to solve this problem too, but it seems
to be much simpler to make struct thread_info arch specific
again. This also makes the conversion to THREAD_INFO_IN_TASK_STRUCT a
bit easier for architectures that have a couple of arch specific stuff
in their thread_info definition.
The arch specific stuff _could_ be moved to thread_struct. However
keeping them in thread_info makes it easier: accessing thread_info
members is simple, since it is at the beginning of the task_struct,
while the thread_struct is at the end. At least on s390 the offsets
needed to access members of the thread_struct (with task_struct as
base) are too large for various asm instructions. This is not a
problem when keeping these members within thread_info.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: keescook@chromium.org
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/1476901693-8492-2-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
AVX512_4VNNIW - Vector instructions for deep learning enhanced word
variable precision.
AVX512_4FMAPS - Vector instructions for deep learning floating-point
single precision.
These new instructions are to be used in future Intel Xeon & Xeon Phi
processors. The bits 2&3 of CPUID[level:0x07, EDX] inform that new
instructions are supported by a processor.
The spec can be found in the Intel Software Developer Manual (SDM) or in
the Instruction Set Extensions Programming Reference (ISE).
Define new feature flags to enumerate the new instructions in /proc/cpuinfo
accordingly to CPUID bits and add the required xsave extensions which are
required for proper operation.
Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20161018150111.29926-1-piotr.luc@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Arnd reported the following objtool warning:
kernel/locking/rwsem.o: warning: objtool: down_write_killable()+0x16: call without frame pointer save/setup
The warning means gcc placed the ____down_write() inline asm (and its
call instruction) before the frame pointer setup in
down_write_killable(), which breaks frame pointer convention and can
result in incorrect stack traces.
Force the stack frame to be created before the call instruction by
listing the stack pointer as an output operand in the inline asm
statement.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1188b7015f04baf361e59de499ee2d7272c59dce.1476393828.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that eagerfpu= is gone, remove it from the docs and some
comments. Also sync the changes to tools/.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/cf430dd4481d41280e93ac6cf0def1007a67fc8e.1476740397.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>