A later patch will compare them with ACPI tables that get loaded at boot or
runtime and if criteria match, a stored one is loaded.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Link: http://lkml.kernel.org/r/1349043837-22659-4-git-send-email-trenn@suse.de
Cc: Len Brown <lenb@kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Eric Piel <eric.piel@tremplin-utc.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is needed for ACPI table overriding via initrd. Beside reserving
memblocks, X86 also requires to flag the memory area to E820_RESERVED or
E820_ACPI in the e820 mappings to be able to io(re)map it later.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Link: http://lkml.kernel.org/r/1349043837-22659-3-git-send-email-trenn@suse.de
Cc: Len Brown <lenb@kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Eric Piel <eric.piel@tremplin-utc.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The ACPI BGRT driver accesses the BIOS logo image when it initializes.
However, ACPI 5.0 (which introduces the BGRT) recommends putting the
logo image in EFI boot services memory, so that the OS can reclaim that
memory. Production systems follow this recommendation, breaking the
ACPI BGRT driver.
Move the bulk of the BGRT code to run during a new EFI late
initialization phase, which occurs after switching EFI to virtual mode,
and after initializing ACPI, but before freeing boot services memory.
Copy the BIOS logo image to kernel memory at that point, and make it
accessible to the BGRT driver. Rework the existing ACPI BGRT driver to
act as a simple wrapper exposing that image (and the properties from the
BGRT) via sysfs.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: http://lkml.kernel.org/r/93ce9f823f1c1f3bb88bdd662cce08eee7a17f5d.1348876882.git.josh@joshtriplett.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The EFI initialization creates virtual mappings for EFI boot services
memory, so if a driver wants to access EFI boot services memory, it
cannot call ioremap itself; doing so will trip the WARN about mapping
RAM twice. Thus, a driver accessing EFI boot services memory must do so
via the existing mapping already created during EFI intiialization.
Since the EFI code already maintains a memory map for that memory, add a
function efi_lookup_mapped_addr to look up mappings in that memory map.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: http://lkml.kernel.org/r/0eb48ae012797912874919110660ad420b90268b.1348876882.git.josh@joshtriplett.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Conflicts:
drivers/net/team/team.c
drivers/net/usb/qmi_wwan.c
net/batman-adv/bat_iv_ogm.c
net/ipv4/fib_frontend.c
net/ipv4/route.c
net/l2tp/l2tp_netlink.c
The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.
qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.
With help from Antonio Quartulli.
Signed-off-by: David S. Miller <davem@davemloft.net>
As TLB shootdown requests to other CPU cores are now using function call
interrupts, TLB shootdowns entry in /proc/interrupts is always shown as 0.
This behavior change was introduced by commit 52aec3308d ("x86/tlb:
replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR").
This patch reverts TLB shootdowns entry in /proc/interrupts to count TLB
shootdowns separately from the other function call interrupts.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Link: http://lkml.kernel.org/r/20120926021128.22212.20440.stgit@hpxw
Acked-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Everyone who selects VIRTIO is also made to select VIRTIO_RING; just make
them synonymous, since we removed the indirection layer some time ago.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Use the mapping of Elf_[SPE]hdr, Elf_Addr, Elf_Sym, Elf_Dyn, Elf_Rel/Rela,
ELF_R_TYPE() and ELF_R_SYM() to either the 32-bit version or the 64-bit version
into asm-generic/module.h for all arches bar MIPS.
Also, use the generic definition mod_arch_specific where possible.
To this end, I've defined three new config bools:
(*) HAVE_MOD_ARCH_SPECIFIC
Arches define this if they don't want to use the empty generic
mod_arch_specific struct.
(*) MODULES_USE_ELF_RELA
Arches define this if their modules can contain RELA records. This causes
the Elf_Rela mapping to be emitted and allows apply_relocate_add() to be
defined by the arch rather than have the core emit an error message.
(*) MODULES_USE_ELF_REL
Arches define this if their modules can contain REL records. This causes
the Elf_Rel mapping to be emitted and allows apply_relocate() to be
defined by the arch rather than have the core emit an error message.
Note that it is possible to allow both REL and RELA records: m68k and mips are
two arches that do this.
With this, some arch asm/module.h files can be deleted entirely and replaced
with a generic-y marker in the arch Kbuild file.
Additionally, I have removed the bits from m32r and score that handle the
unsupported type of relocation record as that's now handled centrally.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The userspace part of UML uses the asm-offsets.h generator mechanism to
create definitions for UM_KERN_<LEVEL> that match the in-kernel
KERN_<LEVEL> constant definitions.
As of commit 04d2c8c83d ("printk: convert
the format for KERN_<LEVEL> to a 2 byte pattern"), KERN_<LEVEL> is no
longer expanded to the literal '"<LEVEL>"', but to '"\001" "LEVEL"', i.e.
it contains two parts.
However, the combo of DEFINE_STR() in
arch/x86/um/shared/sysdep/kernel-offsets.h and sed-y in Kbuild doesn't
support string literals consisting of multiple parts. Hence for all
UM_KERN_<LEVEL> definitions, only the SOH character is retained in the actual
definition, while the remainder ends up in the comment. E.g. in
include/generated/asm-offsets.h we get
#define UM_KERN_INFO "\001" /* "6" KERN_INFO */
instead of
#define UM_KERN_INFO "\001" "6" /* KERN_INFO */
This causes spurious '^A' output in some kernel messages:
Calibrating delay loop... 4640.76 BogoMIPS (lpj=23203840)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 256
^AChecking that host ptys support output SIGIO...Yes
^AChecking that host ptys support SIGIO on close...No, enabling workaround
^AUsing 2.6 host AIO
NET: Registered protocol family 16
bio: create slab <bio-0> at 0
Switching to clocksource itimer
To fix this:
- Move the mapping from UM_KERN_<LEVEL> to KERN_<LEVEL> from
arch/um/include/shared/common-offsets.h to
arch/um/include/shared/user.h, which is preincluded for all userspace
parts,
- Preinclude include/linux/kern_levels.h for all userspace parts, to
obtain the in-kernel KERN_<LEVEL> constant definitions. This doesn't
violate the kernel/userspace separation, as include/linux/kern_levels.h
is self-contained and doesn't expose any other kernel internals.
- Remove the now unused STR() and DEFINE_STR() macros.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
commit c1d7e01d (ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION)
forgot UML and broke IPC on it.
Also UML has to select ARCH_WANT_IPC_PARSE_VERSION usin Kconfig.
Reported-and-tested-by: <Toralf Förster toralf.foerster@gmx.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
The ACPI spec doesn't provide for a way for the bios to pass down
recommended thresholds to the OS on a _per-bank_ basis. This patch adds
a new boot option, which if passed, tells Linux to use CMCI thresholds
set by the bios.
As fail-safe, we initialize threshold to 1 if some banks have not been
initialized by the bios and warn the user.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
There is no fundamental reason why we should switch SMEP and SMAP on
during early cpu initialization just to switch them off again. Now
with %eflags and %cr4 forced to be initialized to a clean state, we
only need the one-way enable. Also, make the functions inline to make
them (somewhat) harder to abuse.
This does mean that SMEP and SMAP do not get initialized anywhere near
as early. Even using early_param() instead of __setup() doesn't give
us control early enough to do this during the early cpu initialization
phase. This seems reasonable to me, because SMEP and SMAP should not
matter until we have userspace to protect ourselves from, but it does
potentially make it possible for a bug involving a "leak of
permissions to userspace" to get uncaught.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
we only use that to tell copy_thread() done by syscall from that
done by kernel_thread(). However, it's easier to do simply by
checking PF_KTHREAD in thread flags.
Merge sys_clone() guts for 32bit and 64bit, while we are at it...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There were reports of users destroying their Fedora installs by a kernel
tarball that replaces the /lib -> /usr/lib symlink. Let's remove the
toplevel directories from the tarball to prevent this from happening.
Reported-by: Andi Kleen <andi@firstfloor.org>
Suggested-by: Ben Hutchings <ben@decadent.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Michal Marek <mmarek@suse.cz>
cast5/avx incorrectly XORs new IV over old IV at end of CBC encryption
function when it should store. This causes CBC encryption to give
incorrect output on multi-page encryption requests.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
We already have a flag word to indicate the existence of MISC_ENABLES,
so use the same flag word to indicate existence of cr4 and EFER, and
always restore them if they exist. That way if something passes a
nonzero value when the value *should* be zero, we will still
initialize it.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/1348529239-17943-1-git-send-email-hpa@linux.intel.com
%cr4 is supposed to reflect a set of features into which the operating
system is opting in. If the BIOS or bootloader leaks bits here, this
is not desirable. Consider a bootloader passing in %cr4.pae set to a
legacy paging kernel, for example -- it will not have any immediate
effect, but the kernel would crash when turning paging on.
A similar argument applies to %eflags, and since we have to look for
%eflags.id being settable we can use a sequence which clears %eflags
as a side effect.
Note that we already do this for x86-64.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348529239-17943-1-git-send-email-hpa@linux.intel.com
* 'xenarm-for-linus' of git://xenbits.xen.org/people/sstabellini/linux-pvhvm:
arm: introduce a DTS for Xen unprivileged virtual machines
MAINTAINERS: add myself as Xen ARM maintainer
xen/arm: compile netback
xen/arm: compile blkfront and blkback
xen/arm: implement alloc/free_xenballooned_pages with alloc_pages/kfree
xen/arm: receive Xen events on ARM
xen/arm: initialize grant_table on ARM
xen/arm: get privilege status
xen/arm: introduce CONFIG_XEN on ARM
xen: do not compile manage, balloon, pci, acpi, pcpu and cpu_hotplug on ARM
xen/arm: Introduce xen_ulong_t for unsigned long
xen/arm: Xen detection and shared_info page mapping
docs: Xen ARM DT bindings
xen/arm: empty implementation of grant_table arch specific functions
xen/arm: sync_bitops
xen/arm: page.h definitions
xen/arm: hypercalls
arm: initial Xen support
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
do_notify_resume() may be called on irq or exception
exit. But at that time the exception has already called
rcu_user_enter() and the irq has already called rcu_irq_exit().
Since it can use RCU read side critical section, we must call
rcu_user_exit() before doing anything there. Then we must call
back rcu_user_enter() after this function because we know we are
going to userspace from there.
This complete support for userspace RCU extended quiescent state
in x86-64.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This way we can exit the RCU extended quiescent state before
we schedule a new task from irq/exception exit.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Add necessary hooks to x86 exception for userspace
RCU extended quiescent state support.
This includes traps, page fault, debug exceptions, etc...
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There is some unnatural label based layout in this function.
Convert the unnecessary goto to readable conditional blocks.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Add syscall slow path hooks to notify syscall entry
and exit on CPUs that want to support userspace RCU
extended quiescent state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cleanup the label maze in this function. Having a
seperate function to first handle the traps that don't
generate a signal makes it easier to convert into
more readable conditional paths.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1348577479-2564-1-git-send-email-fweisbec@gmail.com
[ Fixed 32-bit build failure. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
GAS in binutils(2.16.91) could not parse parentheses within
macro parameters unless fully parenthesized, and this is a
workaround to make old gas work without generating below errors:
arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:387: Error: too many positional arguments
arch/x86/kernel/entry_64.S:389: Error: too many positional arguments
[...]
Signed-off-by: Tao Guo <glorioustao@gmail.com>
Reluctantly-Acked-by: Jan Beulich <jbeulich@novell.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1348648102-12653-1-git-send-email-glorioustao@gmail.com
[ Jan argues that these old GAS versions are fragile - which is so, but lets give them a chance. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 31d2092eb0 ("x86: move
mp_register_lapic_address to boot.c") renamed mp_register_lapic
to acpi_register_lapic. But mp_register_lapic remains in a
comment. So the patch rename it.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Len Brown <lenb@kernel.org>
Link: http://lkml.kernel.org/r/50625239.3050403@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With SMAP, the [f][x]rstor_checking() functions are no longer usable
for user-space pointers by applying a simple __force cast. Instead,
create new [f][x]rstor_user() functions which do the proper SMAP
magic.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1343171129-2747-3-git-send-email-suresh.b.siddha@intel.com
There is no known reason for this option to be unavailable on other
archs than x86. They just need to call enable_sched_clock_irqtime()
if they have a sufficiently finegrained clock to make it working.
Move it to the general option and let the user choose between
it and pure tick based or virtual cputime accounting.
Note that virtual cputime accounting already performs a finegrained
irqtime accounting. CONFIG_IRQ_TIME_ACCOUNTING is a kind of middle ground
between tick and virtual based accounting. So CONFIG_IRQ_TIME_ACCOUNTING
and CONFIG_VIRT_CPU_ACCOUNTING are mutually exclusive choices.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
and it can cause bootup crashes on certain AMD machines.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJQYGDkAAoJEFjIrFwIi8fJrR0H/2iRt3nH3KeG6S2l2UaSvJuH
BqtUSFFtMxKwAc9WC8gx9lc6y9HFig1SThzWuWJulNpF50QnBp38+OuMzEespoUN
JLJtIp/jlYFZL5w2K7soXq7e0elbWTainPWvz5qJE7RifcnclAOGGrf/0LEVf/FQ
xCjn9MLDq5WzbkwuA7TPMDb9RSD3ZJThMShj82kziwpTaniJCpl4YY0lVYUfknXo
t2NTV6Ze6mkzU5QvxSA2ZJt89vsFXQNpsvUK0WzZfLmzugXqnqQHstaVti64ovZc
e64PW61PIRhaZMCDuieclMXWbXFjkp4AsEQJLv3K/kojG20xJ/X08nmolnl64vw=
=t4AN
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull a Xen fix from Konrad Rzeszutek Wilk:
"It is a bug-fix when we run the initial PV guest on a AMD K8 machine
and have CONFIG_AMD_NUMA enabled and detect the NUMA topology from the
Northbridge.
We end up in the situation where the initial domain gets too much
information and gets confused and crashes - the fix is to restrict the
domain to get the information - and we do it by just disabling NUMA on
the PV guest (the hypervisor is still able to do its proper NUMA
allocations of guests).
It is OK to disable the PV guest from accessing NUMA data as right now
we do not inject any NUMA node information to the PV guests. When we
do get to that point, then this patch will have to be reverted."
* Disable PV NUMA support as we do not do anything with it (yet) and it
can cause bootup crashes on certain AMD machines.
* tag 'stable/for-linus-3.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/boot: Disable NUMA for PV guests.
This patch is a follow-up for patch "filter: add XOR instruction for use
with X/K" that implements BPF x86 JIT parts for the BPF XOR operation.
Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
. Convert the trace builtins to use the growing evsel/evlist
tracepoint infrastructure, removing several open coded constructs
like switch like series of strcmp to dispatch events, etc.
Basically what had already been showcased in 'perf sched'.
. Add evsel constructor for tracepoints, that uses libtraceevent
just to parse the /format events file, use it in a new 'perf test'
to make sure the libtraceevent format parsing regressions can
be more readily caught.
. Some strange errors were happening in some builds, but not on the
next, reported by several people, problem was some parser related
files, generated during the build, didn't had proper make deps,
fix from Eric Sandeen.
. Fix some compiling errors on 32-bit, from Feng Tang.
. Don't use sscanf extension %as, not available on bionic, reimplementation
by Irina Tirdea.
. Fix bfd.h/libbfd detection with recent binutils, from Markus Trippelsdorf.
. Introduce struct and cache information about the environment where a
perf.data file was captured, from Namhyung Kim.
. Fix several error paths in libtraceevent, from Namhyung Kim.
Print event causing perf_event_open() to fail in 'perf record',
from Stephane Eranian.
. New 'kvm' analysis tool, from Xiao Guangrong.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJQYIGnAAoJENZQFvNTUqpApNIP/2yUe3kfZ6Sz5rVwVdxyDBQN
Y0sv6IwtT31G6WA8o8fvIKylMiTn0XMt5TG3e57ebw01lHH/xQAkQs7EmpmGclIT
qOE5a1JwD8eyTMhBL92dt9LUHEq103pW/oBS1d6TeQ9XyEh+HThv+wVT/yRKneu6
IHV7b/8fC++v80bpsWLr8S3+p9OXMclsjR7IfDCeJnicxyHb/yWfRVbQiBwGk5/v
IfKc0uat60+1qAy7GwBM/nVDJqzekrPI76krNP8tVdftxBjatpXS+VBKiYfo352g
nr3Gpg2MWEY2oRzkN6jCj5yQAVTxa2OrAYY/I8dSJWkTHXL7UbzNzatNpVpvS0we
0wFP5gEumyuYE1YwjsR14ICSOJdMaO0pBYO4YMnqXsXGiS0JEM+2o+2bL2Nl9EDz
LEGWQGWVC23Tu3P1SaHgdb2YnQLZ18AVBwBljESrLvCf4leC/RWoA3HvJ4NOhook
08ee24r57SLLe7z8W3fBPRsslpmoZnhnLHES/N/qf2y5u4Ig9IkafMS2ARVSwgPX
6u0uVIf48YQGAlv8p3cNsYa2PITB3R04Nm5GB2KvTxplwuRD1jYqPpR4BaKrXe+s
SBVIxgDygglY+husTYLBWMBYswWhu5ECVobdfxtrQbQPt+v9hxf5S8UhAbu3AVcV
IocrCPg3REvuxPwwPmY9
=FgCY
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
* Convert the trace builtins to use the growing evsel/evlist
tracepoint infrastructure, removing several open coded constructs
like switch like series of strcmp to dispatch events, etc.
Basically what had already been showcased in 'perf sched'.
* Add evsel constructor for tracepoints, that uses libtraceevent
just to parse the /format events file, use it in a new 'perf test'
to make sure the libtraceevent format parsing regressions can
be more readily caught.
* Some strange errors were happening in some builds, but not on the
next, reported by several people, problem was some parser related
files, generated during the build, didn't had proper make deps,
fix from Eric Sandeen.
* Fix some compiling errors on 32-bit, from Feng Tang.
* Don't use sscanf extension %as, not available on bionic, reimplementation
by Irina Tirdea.
* Fix bfd.h/libbfd detection with recent binutils, from Markus Trippelsdorf.
* Introduce struct and cache information about the environment where a
perf.data file was captured, from Namhyung Kim.
* Fix several error paths in libtraceevent, from Namhyung Kim.
Print event causing perf_event_open() to fail in 'perf record',
from Stephane Eranian.
* New 'kvm' analysis tool, from Xiao Guangrong.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Switch x86_64 to using sub-ns precise vsyscall
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
To help migrate archtectures over to the new update_vsyscall method,
redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Since users will need to include timekeeper_internal.h, move
update_vsyscall definitions to timekeeper_internal.h.
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
CLOCK_TICK_RATE is used to accurately caclulate exactly how
a tick will be at a given HZ.
This is useful, because while we'd expect NSEC_PER_SEC/HZ,
the underlying hardware will have some granularity limit,
so we won't be able to have exactly HZ ticks per second.
This slight error can cause timekeeping quality problems
when using the jiffies or other jiffies driven clocksources.
Thus we currently use compile time CLOCK_TICK_RATE value to
generate SHIFTED_HZ and NSEC_PER_JIFFIES, which we then use
to adjust the jiffies clocksource to correct this error.
Unfortunately though, since CLOCK_TICK_RATE is a compile
time value, and the jiffies clocksource is registered very
early during boot, there are a number of cases where there
are different possible hardware timers that have different
tick rates. This causes problems in cases like ARM where
there are numerous different types of hardware, each having
their own compile-time CLOCK_TICK_RATE, making it hard to
accurately support different hardware with a single kernel.
For the most part, this doesn't matter all that much, as not
too many systems actually utilize the jiffies or jiffies driven
clocksource. Usually there are other highres clocksources
who's granularity error is negligable.
Even so, we have some complicated calcualtions that we do
everywhere to handle these edge cases.
This patch removes the compile time SHIFTED_HZ value, and
introduces a register_refined_jiffies() function. This results
in the default jiffies clock as being assumed a perfect HZ
freq, and allows archtectures that care about jiffies accuracy
to call register_refined_jiffies() with the tick rate, specified
dynamically at boot.
This allows us, where necessary, to not have a compile time
CLOCK_TICK_RATE constant, simplifies the jiffies code, and
still provides a way to have an accurate jiffies clock.
NOTE: Since this patch does not add register_refinied_jiffies()
calls for every arch, it may cause time quality regressions
in some cases. Its likely these will not be noticable, but
if they are an issue, adding the following to the end of
setup_arch() should resolve the regression:
register_refinied_jiffies(CLOCK_TICK_RATE)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In order to add xen EFI frambebuffer video support, it is required to add
xen-efi's new video type (XEN_VGATYPE_EFI_LFB) case and handle it in the
function xen_init_vga and set the video type to VIDEO_TYPE_EFI to enable
efi video mode.
The original patch from which this was broken out from:
http://marc.info/?i=4E099AA6020000780004A4C6@nat28.tlf.novell.com
Signed-off-by: Jan Beulich <JBeulich@novell.com>
Signed-off-by: Tang Liang <liang.tang@oracle.com>
[v2: The original author is Jan Beulich and Liang Tang ported it to upstream]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The xen c/s 25873 allows the hypervisor to retrieve the NUMLOCK flag.
With this patch, the Linux kernel can get the state according to the
data in the BIOS.
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The hypervisor is in charge of allocating the proper "NUMA" memory
and dealing with the CPU scheduler to keep them bound to the proper
NUMA node. The PV guests (and PVHVM) have no inkling of where they
run and do not need to know that right now. In the future we will
need to inject NUMA configuration data (if a guest spans two or more
NUMA nodes) so that the kernel can make the right choices. But those
patches are not yet present.
In the meantime, disable the NUMA capability in the PV guest, which
also fixes a bootup issue. Andre says:
"we see Dom0 crashes due to the kernel detecting the NUMA topology not
by ACPI, but directly from the northbridge (CONFIG_AMD_NUMA).
This will detect the actual NUMA config of the physical machine, but
will crash about the mismatch with Dom0's virtual memory. Variation of
the theme: Dom0 sees what it's not supposed to see.
This happens with the said config option enabled and on a machine where
this scanning is still enabled (K8 and Fam10h, not Bulldozer class)
We have this dump then:
NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10
Scanning NUMA topology in Northbridge 24
Number of physical nodes 4
Node 0 MemBase 0000000000000000 Limit 0000000040000000
Node 1 MemBase 0000000040000000 Limit 0000000138000000
Node 2 MemBase 0000000138000000 Limit 00000001f8000000
Node 3 MemBase 00000001f8000000 Limit 0000000238000000
Initmem setup node 0 0000000000000000-0000000040000000
NODE_DATA [000000003ffd9000 - 000000003fffffff]
Initmem setup node 1 0000000040000000-0000000138000000
NODE_DATA [0000000137fd9000 - 0000000137ffffff]
Initmem setup node 2 0000000138000000-00000001f8000000
NODE_DATA [00000001f095e000 - 00000001f0984fff]
Initmem setup node 3 00000001f8000000-0000000238000000
Cannot find 159744 bytes in node 3
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96
Pid: 0, comm: swapper Not tainted 3.3.6 #1 AMD Dinar/Dinar
RIP: e030:[<ffffffff81d220e6>] [<ffffffff81d220e6>] __alloc_bootmem_node+0x43/0x96
.. snip..
[<ffffffff81d23024>] sparse_early_usemaps_alloc_node+0x64/0x178
[<ffffffff81d23348>] sparse_init+0xe4/0x25a
[<ffffffff81d16840>] paging_init+0x13/0x22
[<ffffffff81d07fbb>] setup_arch+0x9c6/0xa9b
[<ffffffff81683954>] ? printk+0x3c/0x3e
[<ffffffff81d01a38>] start_kernel+0xe5/0x468
[<ffffffff81d012cf>] x86_64_start_reservations+0xba/0xc1
[<ffffffff81007153>] ? xen_setup_runstate_info+0x2c/0x36
[<ffffffff81d050ee>] xen_start_kernel+0x565/0x56c
"
so we just disable NUMA scanning by setting numa_off=1.
CC: stable@vger.kernel.org
Reported-and-Tested-by: Andre Przywara <andre.przywara@amd.com>
Acked-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull kbuild fixes from Michal Marek:
"There are two more kbuild fixes for 3.6.
One fixes a race between x86's archscripts target and the rule
(re)building scripts/basic/fixdep. The second is a fix for the
previous attempt at fixing make firmware_install with make 3.82.
This new solution should work with any version of GNU make"
* 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
x86/kbuild: archscripts depends on scripts_basic
firmware: fix directory creation rule matching with make 3.80
If arch/x86/kernel/cpuid.c is a module, a CPU might offline or online
between the for_each_online_cpu() loop and the call to
register_hotcpu_notifier in cpuid_init or the call to
unregister_hotcpu_notifier in cpuid_exit. The potential races can
lead to leaks/duplicates, attempts to destroy non-existant devices, or
random pointer dereferences.
For example, in cpuid_exit if:
for_each_online_cpu(cpu)
cpuid_device_destroy(cpu);
class_destroy(cpuid_class);
__unregister_chrdev(CPUID_MAJOR, 0, NR_CPUS, "cpu/cpuid");
<----- CPU onlines
unregister_hotcpu_notifier(&cpuid_class_cpu_notifier);
the hotcpu notifier will attempt to create a device for the
cpuid_class, which the module already destroyed.
This fix surrounds for_each_online_cpu and register_hotcpu_notifier or
unregister_hotcpu_notifier with get_online_cpus+put_online_cpus.
Tested on a VM.
Signed-off-by: Silas Boyd-Wickizer <sbw@mit.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If arch/x86/kernel/msr.c is a module, a CPU might offline or online
between the for_each_online_cpu(i) loop and the call to
register_hotcpu_notifier in msr_init or the call to
unregister_hotcpu_notifier in msr_exit. The potential races can lead
to leaks/duplicates, attempts to destroy non-existant devices, or
random pointer dereferences.
For example, in msr_init if:
for_each_online_cpu(i) {
err = msr_device_create(i);
if (err != 0)
goto out_class;
}
<----- CPU offlines
register_hotcpu_notifier(&msr_class_cpu_notifier);
and the CPU never onlines before msr_exit, then the module will never
call msr_device_destroy for the associated CPU.
This fix surrounds for_each_online_cpu and register_hotcpu_notifier or
unregister_hotcpu_notifier with get_online_cpus+put_online_cpus.
Tested on a VM.
Signed-off-by: Silas Boyd-Wickizer <sbw@mit.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If we reset a vcpu on INIT, we so far overwrote dr7 as provided by
KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally.
Fix this by saving the dr7 used for guest debugging and calculating the
effective register value as well as switch_db_regs on any potential
change. This will change to focus of the set_guest_debug vendor op to
update_dp_bp_intercept.
Found while trying to stop on start_secondary.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
To emulate level triggered interrupts, add a resample option to
KVM_IRQFD. When specified, a new resamplefd is provided that notifies
the user when the irqchip has been resampled by the VM. This may, for
instance, indicate an EOI. Also in this mode, posting of an interrupt
through an irqfd only asserts the interrupt. On resampling, the
interrupt is automatically de-asserted prior to user notification.
This enables level triggered interrupts to be posted and re-enabled
from vfio with no userspace intervention.
All resampling irqfds can make use of a single irq source ID, so we
reserve a new one for this interface.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
* stable/late-swiotlb.v3.3:
xen/swiotlb: Fix compile warnings when using plain integer instead of NULL pointer.
xen/swiotlb: Remove functions not needed anymore.
xen/pcifront: Use Xen-SWIOTLB when initting if required.
xen/swiotlb: For early initialization, return zero on success.
xen/swiotlb: Use the swiotlb_late_init_with_tbl to init Xen-SWIOTLB late when PV PCI is used.
xen/swiotlb: Move the error strings to its own function.
xen/swiotlb: Move the nr_tbl determination in its own function.
swiotlb: add the late swiotlb initialization function with iotlb memory
xen/swiotlb: With more than 4GB on 64-bit, disable the native SWIOTLB.
xen/swiotlb: Simplify the logic.
Conflicts:
arch/x86/xen/pci-swiotlb-xen.c
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reason for merge:
x86/fpu changed the structure of some of the code that x86/smap
changes; mostly fpu-internal.h but also minor changes to the
signal code.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Resolved Conflicts:
arch/x86/ia32/ia32_signal.c
arch/x86/include/asm/fpu-internal.h
arch/x86/kernel/signal.c
Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.
kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().
So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.
Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1348164109.26695.338.camel@sbsiddha-desk.sc.intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull x86 fixes from Ingo Molnar:
"Small fixlets"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/init.c: Fix devmem_is_allowed() off by one
x86/kconfig: Remove outdated reference to Intel CPUs in CONFIG_SWIOTLB
The changes to entry_32.S got missed in checkin:
63bcff2a x86, smap: Add STAC and CLAC instructions to control user space access
The resulting kernel was largely functional but SMAP protection could
have been bypassed.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348256595-29119-9-git-send-email-hpa@linux.intel.com
Use kzalloc() so the struct resource doesn't contain garbage in
fields we don't initialize.
[bhelgaas: changelog]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: x86@kernel.org
Signal handling contains a bunch of accesses to individual user space
items, which causes an excessive number of STAC and CLAC
instructions. Instead, let get/put_user_try ... get/put_user_catch()
contain the STAC and CLAC instructions.
This means that get/put_user_try no longer nests, and furthermore that
it is no longer legal to use user space access functions other than
__get/put_user_ex() inside those blocks. However, these macros are
x86-specific anyway and are only used in the signal-handling paths; a
simple reordering of moving the larger subroutine calls out of the
try...catch blocks resolves that problem.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348256595-29119-12-git-send-email-hpa@linux.intel.com
When Supervisor Mode Access Prevention (SMAP) is enabled, access to
userspace from the kernel is controlled by the AC flag. To make the
performance of manipulating that flag acceptable, there are two new
instructions, STAC and CLAC, to set and clear it.
This patch adds those instructions, via alternative(), when the SMAP
feature is enabled. It also adds X86_EFLAGS_AC unconditionally to the
SYSCALL entry mask; there is simply no reason to make that one
conditional.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348256595-29119-9-git-send-email-hpa@linux.intel.com
The STAC/CLAC instructions are only available with SMAP, but on the
other hand they aren't needed if SMAP is not available, or before we
start to run userspace, so construct them as alternatives which start
out as noops and are enabled by the alternatives mechanism.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348256595-29119-7-git-send-email-hpa@linux.intel.com
* Fix M2P batching re-using the incorrect structure field.
* Disable BIOS SMP MP table search.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJQXGfdAAoJEFjIrFwIi8fJbWcH/0FI2d/VyB+ZU0ng3R0Oa7mt
iR/x+Z+mfFdp2dXS6gs6DgJIZVA7i2K9pX4rOXjpDGGGyUeo1xoqjlQfsFWQGjZ/
p49RrDrM93c2GdRXk3iMSWfboQI7BXBs5rnyYZQL7kMxUSR75MxbeONvhPrMSO9I
3EBidWH08qjrn2HVF44F6xh5ONjpclo5AvGIzJ0eU4X0D0eqMnhvlAw8/UYJU2HV
heRvuxWF9l2jNpLhKhZy1730D1X/vKA5qKAcBW8rCOpEijyPpmtKbqapeUJg/9pH
NVquuwGutP5ozrSi7a/23+L+ezvQBmCPm5ZRG44PccBoZ/HVs8haT8UypSWSDzo=
=TwvM
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
- Fix M2P batching re-using the incorrect structure field.
In v3.5 we added batching for M2P override (Machine Frame Number ->
Physical Frame Number), but the original MFN was saved in an
incorrect structure - and we would oops/restore when restoring with
the old MFN.
- Disable BIOS SMP MP table search.
A bootup issue that we had ignored until we found that on DL380 G6 it
was needed.
* tag 'stable/for-linus-3.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/boot: Disable BIOS SMP MP table search.
xen/m2p: do not reuse kmap_op->dev_bus_addr
Exporting KVM exit information to userspace to be consumed by perf.
Signed-off-by: Dong Hao <haodong@linux.vnet.ibm.com>
[ Dong Hao <haodong@linux.vnet.ibm.com>: rebase it on acme's git tree ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Runzhen Wang <runzhen@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1347870675-31495-2-git-send-email-haodong@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
While building the SUSE kernel packages, which build the scripts,
make clean, and then build everything, we have been running into spurious
build failures. We tracked them down to a simple dependency issue:
$ make mrproper
CLEAN arch/x86/tools
CLEAN scripts/basic
$ cp patches/config/x86_64/desktop .config
$ make archscripts
HOSTCC arch/x86/tools/relocs
/bin/sh: scripts/basic/fixdep: No such file or directory
make[3]: *** [arch/x86/tools/relocs] Error 1
make[2]: *** [archscripts] Error 2
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2
This was introduced by commit
6520fe55 (x86, realmode: 16-bit real-mode code support for relocs),
which added the archscripts dependency to archprepare.
This patch adds the scripts_basic dependency to the x86 archscripts.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
we only use that to tell copy_thread() done by syscall from that
done by kernel_thread(). However, it's easier to do simply by
checking PF_KTHREAD in thread flags.
Merge sys_clone() guts for 32bit and 64bit, while we are at it...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
TIF_NOTIFY_RESUME will work in precisely the same way; all that
is achieved by TIF_IRET is appearing that there's some work to be
done, so we end up on the iret exit path. Just use NOTIFY_RESUME.
And for execve() do that in 32bit start_thread(), not sys_execve()
itself.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I get this warning:
arch/x86/kernel/kprobes.c:544:23: warning: ‘skip_singlestep’ declared ‘static’ but never defined
on tip/auto-latest.
Put the skip_singlestep function declaration up, in
KPROBES_CAN_USE_FTRACE and drop the superfluous forward
declaration.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1348145034-16603-1-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
'ac' essentially reconstructs the 'access' variable we already
have, except for the PFERR_PRESENT_MASK and PFERR_RSVD_MASK. As
these are not used by callees, just use 'access' directly.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Keep track of accessed/dirty bits; if they are all set, do not
enter the accessed/dirty update loop.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
'eperm' is no longer used in the walker loop, so we can eliminate it.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of branchy code depending on level, gpte.ps, and mmu configuration,
prepare everything in a bitmap during mode changes and look it up during
runtime.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The page table walk is coded as an infinite loop, with a special
case on the last pte.
Code it as an ordinary loop with a termination condition on the last
pte (large page or walk length exhausted), and put the last pte handling
code after the loop where it belongs.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
While unspecified, the behaviour of Intel processors is to first
perform the page table walk, then, if the walk was successful, to
atomically update the accessed and dirty bits of walked paging elements.
While we are not required to follow this exactly, doing so will allow us
to perform the access permissions check after the walk is complete, rather
than after each walk step.
(the tricky case is SMEP: a zero in any pte's U bit makes the referenced
page a supervisor page, so we can't fault on a one bit during the walk
itself).
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We no longer rely on paging_tmpl.h defines; so we can move the function
to mmu.c.
Rely on zero extension to 64 bits to get the correct nx behaviour.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If nx is disabled, then is gpte[63] is set we will hit a reserved
bit set fault before checking permissions; so we can ignore the
setting of efer.nxe.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
gpte_access() computes the access permissions of a guest pte and also
write-protects clean gptes. This is wrong when we are servicing a
write fault (since we'll be setting the dirty bit momentarily) but
correct when instantiating a speculative spte, or when servicing a
read fault (since we'll want to trap a following write in order to
set the dirty bit).
It doesn't seem to hurt in practice, but in order to make the code
readable, push the write protection out of gpte_access() and into
a new protect_clean_gpte() which is called explicitly when needed.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
As the initial domain we are able to search/map certain regions
of memory to harvest configuration data. For all low-level we
use ACPI tables - for interrupts we use exclusively ACPI _PRT
(so DSDT) and MADT for INT_SRC_OVR.
The SMP MP table is not used at all. As a matter of fact we do
not even support machines that only have SMP MP but no ACPI tables.
Lets follow how Moorestown does it and just disable searching
for BIOS SMP tables.
This also fixes an issue on HP Proliant BL680c G5 and DL380 G6:
9f->100 for 1:1 PTE
Freeing 9f-100 pfn range: 97 pages freed
1-1 mapping on 9f->100
.. snip..
e820: BIOS-provided physical RAM map:
Xen: [mem 0x0000000000000000-0x000000000009efff] usable
Xen: [mem 0x000000000009f400-0x00000000000fffff] reserved
Xen: [mem 0x0000000000100000-0x00000000cfd1dfff] usable
.. snip..
Scan for SMP in [mem 0x00000000-0x000003ff]
Scan for SMP in [mem 0x0009fc00-0x0009ffff]
Scan for SMP in [mem 0x000f0000-0x000fffff]
found SMP MP-table at [mem 0x000f4fa0-0x000f4faf] mapped at [ffff8800000f4fa0]
(XEN) mm.c:908:d0 Error getting mfn 100 (pfn 5555555555555555) from L1 entry 0000000000100461 for l1e_owner=0, pg_owner=0
(XEN) mm.c:4995:d0 ptwr_emulate: could not get_page_from_l1e()
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81ac07e2>] xen_set_pte_init+0x66/0x71
. snip..
Pid: 0, comm: swapper Not tainted 3.6.0-rc6upstream-00188-gb6fb969-dirty #2 HP ProLiant BL680c G5
.. snip..
Call Trace:
[<ffffffff81ad31c6>] __early_ioremap+0x18a/0x248
[<ffffffff81624731>] ? printk+0x48/0x4a
[<ffffffff81ad32ac>] early_ioremap+0x13/0x15
[<ffffffff81acc140>] get_mpc_size+0x2f/0x67
[<ffffffff81acc284>] smp_scan_config+0x10c/0x136
[<ffffffff81acc2e4>] default_find_smp_config+0x36/0x5a
[<ffffffff81ac3085>] setup_arch+0x5b3/0xb5b
[<ffffffff81624731>] ? printk+0x48/0x4a
[<ffffffff81abca7f>] start_kernel+0x90/0x390
[<ffffffff81abc356>] x86_64_start_reservations+0x131/0x136
[<ffffffff81abfa83>] xen_start_kernel+0x65f/0x661
(XEN) Domain 0 crashed: 'noreboot' set - not rebooting.
which is that ioremap would end up mapping 0xff using _PAGE_IOMAP
(which is what early_ioremap sticks as a flag) - which meant
we would get MFN 0xFF (pte ff461, which is OK), and then it would
also map 0x100 (b/c ioremap tries to get page aligned request, and
it was trying to map 0xf4fa0 + PAGE_SIZE - so it mapped the next page)
as _PAGE_IOMAP. Since 0x100 is actually a RAM page, and the _PAGE_IOMAP
bypasses the P2M lookup we would happily set the PTE to 1000461.
Xen would deny the request since we do not have access to the
Machine Frame Number (MFN) of 0x100. The P2M[0x100] is for example
0x80140.
CC: stable@vger.kernel.org
Fixes-Oracle-Bugzilla: https://bugzilla.oracle.com/bugzilla/show_bug.cgi?id=13665
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
list_for_each_entry_reverse() dereferences the iterator, but we already
freed it. I don't see a reason that this has to be done in reverse order
so change it to use list_for_each_entry_safe().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This patch updates the existing Intel IvyBridge (model 58)
support with proper PEBS event constraints. It cannot reuse
the same as SandyBridge because some events (0xd3) are
specific to IvyBridge.
Also there is no UOPS_DISPATCHED.THREAD on IVB, so do not
populate the PERF_COUNT_HW_STALLED_CYCLES_BACKEND mapping.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Link: http://lkml.kernel.org/r/20120910230701.GA5898@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make "REP BSF" unconditional, as per the suggestion of hpa
and Linus, this removes the insane BSF_PREFIX conditional
and simplifies the logic.
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/5058741E020000780009C014@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When acting on a user bug report, we find ourselves constantly
asking for /proc/cpuinfo in order to know the exact family,
model, stepping of the CPU in question.
Instead of having to ask this, add this to dmesg so that it is
visible and no ambiguities can ensue from looking at the
official name string of the CPU coming from CPUID and trying
to map it to f/m/s.
Output then looks like this:
[ 0.146041] smpboot: CPU0: AMD FX(tm)-8100 Eight-Core Processor (fam: 15, model: 01, stepping: 02)
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/1347640666-13638-1-git-send-email-bp@amd64.org
[ tweaked it minimally to add commas. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The test should be >= ARRAY_SIZE() instead of > ARRAY_SIZE().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/20120905123126.GC6128@elgon.mountain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
where CONFIG_X86_MCE is disabled and this is a bad idea so turning it on
by default makes sense to me. The second one is a trivial cleanup.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQV2BDAAoJEBLB8Bhh3lVKbfcP/AkfNP5rbK5P5bkoD8Y4jc0r
XFfTd//96sjN9SvcZyO/dVnO3S8600AJXfjUs+Amd9horaUOdzf1X8xfOKmvupbL
MqG9WKOaIdgCdXfXvh6S2vNySNErybLWLaxpm8n/7xjQhumVzo1TU41mMzcU5Ctt
OmhGD2hjF7DXSVDIirDs8XBGyr3YbVRN1Dj/FQDZ1rjOEjAMNsQnZVthFdhHOmfN
YKaGdP+WbI38Jm3RGEDi3mcnv63A4jL6G4IRDEnsFg45ull1HjTN1tF6lz8tbwDc
M36jZRM3sqMYDk5P6a9qtnuNTe51Sl9HkbA9Ym0+5hrWua/c16Z6Bt5+NgoDO61O
LlCU0FZNtmJujqC2oUdVWovpr0z1gnZwlbcvVIPEKVnln5iZC5PkPSxZg7Xn6K7e
1mMpWA6y9vsF32qxZuD2XAF32/muVgz0zFbVQRgKH6YS064PB2HWgr9ZU0XtVLW9
9Tfrafqg0CQO/o74oAnbjjg87O4C5p53elpSa/4+4Hwn9E2M+6lKjVyAWNzKv2r3
+POT4O1IgOE7f2WnPsR1w1F3P00e65xE3IC0JnZyRYh1P8P1FEnVWklGxi/y+dNN
qVzNDXE5ZL+fKEs2WbP9gr6lTDB0riGc93p9iycDQKVpVq5mePe34CZk/SEZzoGV
CH/lZMXMrkYqef5wNaAT
=ywsW
-----END PGP SIGNATURE-----
Merge tag 'ras_queue_for_3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce
Pull MCE changes from Borislav Petkov:
" Patch 1/2 which enables MCA by default because I still see bugreports
where CONFIG_X86_MCE is disabled and this is a bad idea so turning it on
by default makes sense to me. The second one is a trivial cleanup. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most architectures implement this in exactly the same way. Instead of
having each architecture duplicate this function, provide a single
implementation in the core and make it a weak symbol so that it can be
overridden on architectures where it is required.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Remove the __init annotations in order to keep pci_fixup_irqs() around
after init (e.g. for hotplug). This requires the same change for the
implementation of pcibios_update_irq() on all architectures. While at
it, all __devinit annotations are removed as well, since they will be
useless now that HOTPLUG is always on.
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CPUs with FXSAVE but no XMM/MXCSR (Pentium II from Intel,
Crusoe/TM-3xxx/5xxx from Transmeta, and presumably some of the K6
generation from AMD) ever looked at the mxcsr field during
fxrstor/fxsave. So remove the cpu_has_xmm check in the fx_finit()
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347300665-6209-6-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Add the "eagerfpu=auto" (that selects the default scheme in
enabling eagerfpu) which can override compiled-in boot parameters
like "eagerfpu=on/off" (that force enable/disable eagerfpu).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347300665-6209-5-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
xsaveopt/xrstor support optimized state save/restore by tracking the
INIT state and MODIFIED state during context-switch.
Enable eagerfpu by default for processors supporting xsaveopt.
Can be disabled by passing "eagerfpu=off" boot parameter.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347300665-6209-3-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Decouple non-lazy/eager fpu restore policy from the existence of the xsave
feature. Introduce a synthetic CPUID flag to represent the eagerfpu
policy. "eagerfpu=on" boot paramter will enable the policy.
Requested-by: H. Peter Anvin <hpa@zytor.com>
Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347300665-6209-2-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Fundamental model of the current Linux kernel is to lazily init and
restore FPU instead of restoring the task state during context switch.
This changes that fundamental lazy model to the non-lazy model for
the processors supporting xsave feature.
Reasons driving this model change are:
i. Newer processors support optimized state save/restore using xsaveopt and
xrstor by tracking the INIT state and MODIFIED state during context-switch.
This is faster than modifying the cr0.TS bit which has serializing semantics.
ii. Newer glibc versions use SSE for some of the optimized copy/clear routines.
With certain workloads (like boot, kernel-compilation etc), application
completes its work with in the first 5 task switches, thus taking upto 5 #DNA
traps with the kernel not getting a chance to apply the above mentioned
pre-load heuristic.
iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit
and thus will not work correctly in the presence of lazy restore. Non-lazy
state restore is needed for enabling such features.
Some data on a two socket SNB system:
* Saved 20K DNA exceptions during boot on a two socket SNB system.
* Saved 50K DNA exceptions during kernel-compilation workload.
* Improved throughput of the AVX based checksumming function inside the
kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts
pair.
Also now kernel_fpu_begin/end() relies on the patched
alternative instructions. So move check_fpu() which uses the
kernel_fpu_begin/end() after alternative_instructions().
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-7-git-send-email-suresh.b.siddha@intel.com
Merge 32-bit boot fix from,
Link: http://lkml.kernel.org/r/1347300665-6209-4-git-send-email-suresh.b.siddha@intel.com
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
use kernel_fpu_begin/end() instead of unconditionally accessing cr0 and
saving/restoring just the few used xmm/ymm registers.
This has some advantages like:
* If the task's FPU state is already active, then kernel_fpu_begin()
will just save the user-state and avoiding the read/write of cr0.
In general, cr0 accesses are much slower.
* Manual save/restore of xmm/ymm registers will affect the 'modified' and
the 'init' optimizations brought in the by xsaveopt/xrstor
infrastructure.
* Foward compatibility with future vector register extensions will be a
problem if the xmm/ymm registers are manually saved and restored
(corrupting the extended state of those vector registers).
With this patch, there was no significant difference in the xor throughput
using AVX, measured during boot.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-5-git-send-email-suresh.b.siddha@intel.com
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
kvm's guest fpu save/restore should be wrapped around
kernel_fpu_begin/end(). This will avoid for example taking a DNA
in kvm_load_guest_fpu() when it tries to load the fpu immediately
after doing unlazy_fpu() on the host side.
More importantly this will prevent the host process fpu from being
corrupted.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-4-git-send-email-suresh.b.siddha@intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Few lines below we do drop_fpu() which is more safer. Remove the
unnecessary user_fpu_end() in save_xstate_sig(), which allows
the drop_fpu() to ignore any pending exceptions from the user-space
and drop the current fpu.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-3-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
No need to save the state with unlazy_fpu(), that is about to get overwritten
by the state from the signal frame. Instead use drop_fpu() and continue
to restore the new state.
Also fold the stop_fpu_preload() into drop_fpu().
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-2-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied
to/from the fpstate in the task struct.
And in the case of signal delivery for x86_64 binaries, if the fpstate is live
in the CPU registers, then the live state is copied directly to the user
sigframe. Otherwise fpstate in the task struct is copied to the user sigframe.
During restore, fpstate in the user sigframe is restored directly to the live
CPU registers.
Historically, different code paths led to different bugs. For example,
x86_64 code path was not preemption safe till recently. Also there is lot
of code duplication for support of new features like xsave etc.
Unify signal handling code paths for x86 and x86_64 kernels.
New strategy is as follows:
Signal delivery: Both for 32/64-bit frames, align the core math frame area to
64bytes as needed by xsave (this where the main fpu/extended state gets copied
to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave
frames). If the state is live, copy the register state directly to the user
frame. If not live, copy the state in the thread struct to the user frame. And
for 32-bit [f]xsave frames, construct the fsave header separately before
the actual [f]xsave area.
Signal return: As the 32-bit frames with [f]xstate has an additional
'fsave' header, copy everything back from the user sigframe to the
fpstate in the task structure and reconstruct the fxstate from the 'fsave'
header (Also user passed pointers may not be correctly aligned for
any attempt to directly restore any partial state). At the next fpstate usage,
everything will be restored to the live CPU registers.
For all the 64-bit frames and the 32-bit fsave frame, restore the state from
the user sigframe directly to the live CPU registers. 64-bit signals always
restored the math frame directly, so we can expect the math frame pointer
to be correctly aligned. For 32-bit fsave frames, there are no alignment
requirements, so we can restore the state directly.
"lat_sig catch" microbenchmark numbers (for x86, x86_64, x86_32 binaries) are
with in the noise range with this change.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1343171129-2747-4-git-send-email-suresh.b.siddha@intel.com
[ Merged in compilation fix ]
Link: http://lkml.kernel.org/r/1344544736.8326.17.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* pm-cpufreq:
cpufreq: OMAP: remove loops_per_jiffy recalculate for smp
sections: fix section conflicts in drivers/cpufreq
cpufreq: conservative: update frequency when limits are relaxed
cpufreq / ondemand: update frequency when limits are relaxed
cpufreq: Add a generic cpufreq-cpu0 driver
PM / OPP: Initialize OPP table from device tree
ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp
cpufreq: Remove support for hardware P-state chips from powernow-k8
acpi-cpufreq: Add compatibility for legacy AMD cpb sysfs knob
acpi-cpufreq: Add support for disabling dynamic overclocking
ACPI: Add fixups for AMD P-state figures
powernow-k8: delay info messages until initialization has succeeded
cpufreq: Add warning message to powernow-k8
acpi-cpufreq: Add quirk to disable _PSD usage on all AMD CPUs
acpi-cpufreq: Add support for modern AMD CPUs
cpufreq / powernow-k8: Fixup missing _PSS objects message
PM / cpufreq: Initialise the cpu field during conservative governor start
MCA is the basic support for hardware error logging and reporting, and
it is majorly unwise to run without it so enable machine check software
support by default on x86.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Tony Luck <tony.luck@intel.com>
arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer
arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
With this patch we provide the functionality to initialize the
Xen-SWIOTLB late in the bootup cycle - specifically for
Xen PCI-frontend. We still will work if the user had
supplied 'iommu=soft' on the Linux command line.
Note: We cannot depend on after_bootmem to automatically
determine whether this is early or not. This is because
when PCI IOMMUs are initialized it is after after_bootmem but
before a lot of "other" subsystems are initialized.
CC: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
[v1: Fix smatch warnings]
[v2: Added check for xen_swiotlb]
[v3: Rebased with new xen-swiotlb changes]
[v4: squashed xen/swiotlb: Depending on after_bootmem is not correct in]
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
vcpu mutex can be held for unlimited time so
taking it with mutex_lock on an ioctl is wrong:
one process could be passed a vcpu fd and
call this ioctl on the vcpu used by another process,
it will then be unkillable until the owner exits.
Call mutex_lock_killable instead and return status.
Note: mutex_lock_interruptible would be even nicer,
but I am not sure all users are prepared to handle EINTR
from these ioctls. They might misinterpret it as an error.
Cleanup paths expect a vcpu that can't be used by
any userspace so this will always succeed - catch bugs
by calling BUG_ON.
Catch callers that don't check return state by adding
__must_check.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch adds a cpumask file to the uncore pmu sysfs directory. The
cpumask file contains one active cpu for every socket.
Signed-off-by: "Yan, Zheng" <zheng.z.yan@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Yan, Zheng" <zheng.z.yan@intel.com>
Link: http://lkml.kernel.org/r/1347263631-23175-2-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Use macros for bitness-insensitive register names, instead of
rolling our own.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Use macros for bitness-insensitive register names, instead of
rolling our own.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
LTO (link-time optimization) doesn't like local labels to be referred to
from a different function, since the two functions may be built in separate
compilation units. Use an external variable instead.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Seth Forshee reported that his system was reporting that the EFI framebuffer
stretched from 0x90010000-0xb0010000 despite the GPU's BAR only covering
0x90000000-0x9ffffff. It's safer to calculate this value from the pixel
stride and screen height (values we already depend on) rather than face
potential problems with resource allocation later on.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
The majority of the DMI checks in efifb are for cases where the bootloader
has provided invalid information. However, on some machines the overrides
may do more harm than good due to configuration differences between machines
with the same machine identifier. It turns out that it's possible for the
bootloader to get the correct information on GOP-based systems, but we
can't guarantee that the kernel's being booted with one that's been updated
to do so. Add support for a capabilities flag that can be set by the
bootloader, and skip the DMI checks in that case. Additionally, set this
flag in the UEFI stub code.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
A value of efi.runtime_version is checked before calling
update_capsule()/query_variable_info() as follows.
But it isn't initialized anywhere.
<snip>
static efi_status_t virt_efi_query_variable_info(u32 attr,
u64 *storage_space,
u64 *remaining_space,
u64 *max_variable_size)
{
if (efi.runtime_version < EFI_2_00_SYSTEM_TABLE_REVISION)
return EFI_UNSUPPORTED;
<snip>
This patch initializes a value of efi.runtime_version at boot time.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Acked-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
We can't assume the presence of the red zone while we're still in a boot
services environment, so we should build with -fno-red-zone to avoid
problems. Change the size of wchar at the same time to make string handling
simpler.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
We currently use the PCI IO protocol as a proxy for a functional GOP. This
is less than ideal, since some platforms will put the GOP on output devices
rather than the GPU itself. Move to using the conout protocol. This is not
guaranteed per-spec, but is part of the consplitter implementation that
causes this problem in the first place and so should be reliable.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Conflicts:
net/netfilter/nfnetlink_log.c
net/netfilter/xt_LOG.c
Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.
Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.
Signed-off-by: David S. Miller <davem@davemloft.net>
Make arch_uprobe_task->saved_trap_nr "unsigned int" and move it down
after ->saved_scratch_register, this changes sizeof() from 24 to 16.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
arch_uprobe_disable_step() should also take UTASK_SSTEP_TRAPPED into
account. In this case the probed insn was not executed, we need to
clear X86_EFLAGS_TF if it was set by us and that is all.
Again, this code will look more clean when we move it into
arch_uprobe_post_xol() and arch_uprobe_abort_xol().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
arch_uprobe_disable_step() correctly preserves X86_EFLAGS_TF and
returns to user-mode. But this means the application gets SIGTRAP
only after the next insn.
This means that UPROBE_CLEAR_TF logic is not really right. _enable
should only record the state of X86_EFLAGS_TF, and _disable should
check it separately from UPROBE_FIX_SETF.
Remove arch_uprobe_task->restore_flags, add ->saved_tf instead, and
change enable/disable accordingly. This assumes that the probed insn
was not trapped, see the next patch.
arch_uprobe_skip_sstep() logic has the same problem, change it to
check X86_EFLAGS_TF and send SIGTRAP as well. We will cleanup this
all after we fold enable/disable_step into pre/post_hol hooks.
Note: send_sig(SIGTRAP) is not actually right, we need send_sigtrap().
But this needs more changes, handle_swbp() does the same and this is
equally wrong.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
user_enable/disable_single_step() was designed for ptrace, it assumes
a single user and does unnecessary and wrong things for uprobes. For
example:
- arch_uprobe_enable_step() can't trust TIF_SINGLESTEP, an
application itself can set X86_EFLAGS_TF which must be
preserved after arch_uprobe_disable_step().
- we do not want to set TIF_SINGLESTEP/TIF_FORCED_TF in
arch_uprobe_enable_step(), this only makes sense for ptrace.
- otoh we leak TIF_SINGLESTEP if arch_uprobe_disable_step()
doesn't do user_disable_single_step(), the application will
be killed after the next syscall.
- arch_uprobe_enable_step() does access_process_vm() we do
not need/want.
Change arch_uprobe_enable/disable_step() to set/clear X86_EFLAGS_TF
directly, this is much simpler and more correct. However, we need to
clear TIF_BLOCKSTEP/DEBUGCTLMSR_BTF before executing the probed insn,
add set_task_blockstep(false).
Note: with or without this patch, there is another (hopefully minor)
problem. A probed "pushf" insn can see the wrong X86_EFLAGS_TF set by
uprobes. Perhaps we should change _disable to update the stack, or
teach arch_uprobe_skip_sstep() to emulate this insn.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Afaics the usage of update_debugctlmsr() and TIF_BLOCKSTEP in
step.c was always very wrong.
1. update_debugctlmsr() was simply unneeded. The child sleeps
TASK_TRACED, __switch_to_xtra(next_p => child) should notice
TIF_BLOCKSTEP and set/clear DEBUGCTLMSR_BTF after resume if
needed.
2. It is wrong. The state of DEBUGCTLMSR_BTF bit in CPU register
should always match the state of current's TIF_BLOCKSTEP bit.
3. Even get_debugctlmsr() + update_debugctlmsr() itself does not
look right. Irq can change other bits in MSR_IA32_DEBUGCTLMSR
register or the caller can be preempted in between.
4. It is not safe to play with TIF_BLOCKSTEP if task != current.
DEBUGCTLMSR_BTF and TIF_BLOCKSTEP should always match each
other if the task is running. The tracee is stopped but it
can be SIGKILL'ed right before set/clear_tsk_thread_flag().
However, now that uprobes uses user_enable_single_step(current)
we can't simply remove update_debugctlmsr(). So this patch adds
the additional "task == current" check and disables irqs to avoid
the race with interrupts/preemption.
Unfortunately this patch doesn't solve the last problem, we need
another fix. Probably we should teach ptrace_stop() to set/clear
single/block stepping after resume.
And afaics there is yet another problem: perf can play with
MSR_IA32_DEBUGCTLMSR from nmi, this obviously means that even
__switch_to_xtra() has problems.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
No functional changes, preparation for the next fix and for uprobes
single-step fixes.
Move the code playing with TIF_BLOCKSTEP/DEBUGCTLMSR_BTF into the
new helper, set_task_blockstep().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
The arch specific implementation behaves like user_enable_single_step()
except that it does not disable single stepping if it was already
enabled by ptrace. This allows the debugger to single step over an
uprobe. The state of block stepping is not restored. It makes only sense
together with TF and if that was enabled then the debugger is notified.
Note: this is still not correct. For example, TIF_SINGLESTEP check
is not right, the application itself can set X86_EFLAGS_TF. And otoh
we leak TIF_SINGLESTEP (set by enable) if the probed insn is "popf".
See the next patches, we need the changes in arch/x86/kernel/step.c
first.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Pull perf fixes from Ingo Molnar:
"This tree includes various fixes"
Ingo really needs to improve on the whole "explain git pull" part.
"Various fixes" indeed.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled
perf/x86: Enable Intel Cedarview Atom suppport
perf_event: Switch to internal refcount, fix race with close()
oprofile, s390: Fix uninitialized memory access when writing to oprofilefs
perf/x86: Fix microcode revision check for SNB-PEBS
Fix kprobes/x86 to support jprobes on ftrace-based kprobes.
Because of -mfentry support of ftrace, ftrace is now put
on the beginning of function where jprobes are put.
Originally ftrace-based kprobes doesn't support jprobe
because it will change regs->ip and ftrace doesn't support
changing IP and ftrace itself doesn't conflict jprobe.
However, ftrace -mfentry support moves mcount call on the
top of functions where jprobes are put. This means that
jprobe always conflicts with ftrace-based kprobe and fails.
This patch allows ftrace-based kprobes to support jprobes
by allowing to modify regs->ip and kprobes breakpoint
handler also allows to skip singlestepping because there
is a ftrace call (not an original instruction).
Link: http://lkml.kernel.org/r/20120905143125.10329.90836.stgit@localhost.localdomain
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow ftrace handlers to change RIP register (regs->ip)
in handlers. This will allow handlers to call another
function instead of original function.
Link: http://lkml.kernel.org/r/20120905143118.10329.5078.stgit@localhost.localdomain
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Current kprobe_ftrace_handler expects regs->ip == ip, but it is
incorrect (originally on x86-64). Actually, ftrace handler sets
regs->ip = ip + MCOUNT_INSN_SIZE.
kprobe_ftrace_handler must take care for that.
Link: http://lkml.kernel.org/r/20120905143112.10329.72069.stgit@localhost.localdomain
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adjust x86 regs.ip to ip + MCOUNT_INSN_SIZE as like as
on x86-64. This helps us to consolidate codes which use
regs->ip on both of x86/x86-64.
Link: http://lkml.kernel.org/r/20120905143100.10329.60109.stgit@localhost.localdomain
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The main goal here is to have the resulting .config no carry any
options that aren't enabled and can't be (i.e such where the
default is "no" and can't be changed), so that if any such
option later gets a user visible prompt, the user will actually
be prompted on a "make ...oldconfig" rather than keeping the
previously invisible option disabled.
There's a little bit of other trivial cleanup mixed in here.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/504DEE19020000780009A285@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Following a relatively recent compiler change, make use of the
fact that for non-zero input BSF and TZCNT produce the same
result, and that CPUs not knowing of TZCNT will treat the
instruction as BSF (i.e. ignore what looks like a REP prefix to
them). The assumption here is that TZCNT would never have worse
performance than BSF.
For the moment, only do this when the respective generic-CPU
option is selected (as there are no specific-CPU options
covering the CPUs supporting TZCNT), and don't do that when size
optimization was requested.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/504DEA1B020000780009A277@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 64-bit special cases of the former two (the thrird one is
64-bit only anyway) don't need to use "long" temporaries, as the
result will always fit in a 32-bit variable, and the functions
return plain "int". This avoids a few REX prefixes, i.e.
minimally reduces code size.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/504DE550020000780009A258@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On 64 bit x86 we save the current eflags in cpu_init for use in
ret_from_fork. Strictly speaking reserved bits in EFLAGS should
be read as written but in practise it is unlikely that EFLAGS
could ever be extended in this way and the kernel alread clears
any undefined flags early on.
The equivalent 32 bit code simply hard codes 0x0202 as the new
EFLAGS.
This change makes 64 bit use the same mechanism to setup the
initial EFLAGS on fork. Note that 64 bit resets EFLAGS before
calling schedule_tail() as opposed to 32 bit which calls
schedule_tail() first. Therefore the correct value for EFLAGS
has opposite IF bit.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/20120824195847.GA31628@moon
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steve Rostedt asked for the merge of a single commit, into both
the RCU and the perf/tracing tree:
| Josh made a change to the tracing code that affects both the
| work Paul McKenney and I are currently doing. At the last
| Kernel Summit back in August, Linus said when such a case
| exists, it is best to make a separate branch based off of his
| tree and place the change there. This way, the repositories
| that need to share the change can both pull them in and the
| SHA1 will match for both. Whichever branch is pulled in first
| by Linus will also pull in the necessary change for the other
| branch as well.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current implementation simply ignores attribute flags. Thus, there is
no notification to userland of unsupported features. Check syscall's
attribute flags to let userland know if a feature is supported by the
kernel. This is also needed to distinguish between future kernels what
might support a feature.
Cc: <stable@vger.kernel.org> v3.5..
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120910093018.GO8285@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch exports the clockticks event and its encoding to user level.
The clockticks event was exported for Nehalem/Westmere but not for Sandy
Bridge (client). Given that it uses a special encoding, it needs to be
exported to user tools, so users can do:
# perf stat -a -C 0 -e uncore_cbox_0/clockticks/ sleep 1
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120829130122.GA32336@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
find_highest_vector() and count_vectors():
- Instead of using magic values, define and use proper macros.
find_highest_vector():
- Remove likely() which is there only for historical reasons and not
doing correct branch predictions anymore. Using such heuristics
to optimize this function is not worth it now. Let CPUs predict
things instead.
- Stop checking word[0] separately. This was only needed for doing
likely() optimization.
- Use for loop, not while, to iterate over the register array to make
the code clearer.
Note that we actually confirmed that the likely() did wrong predictions
by inserting debug code.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the caller passes a valid kmap_op to m2p_add_override, we use
kmap_op->dev_bus_addr to store the original mfn, but dev_bus_addr is
part of the interface with Xen and if we are batching the hypercalls it
might not have been written by the hypervisor yet. That means that later
on Xen will write to it and we'll think that the original mfn is
actually what Xen has written to it.
Rather than "stealing" struct members from kmap_op, keep using
page->index to store the original mfn and add another parameter to
m2p_remove_override to get the corresponding kmap_op instead.
It is now responsibility of the caller to keep track of which kmap_op
corresponds to a particular page in the m2p_override (gntdev, the only
user of this interface that passes a valid kmap_op, is already doing that).
CC: stable@kernel.org
Reported-and-Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* stable/128gb.v5.1:
xen/mmu: If the revector fails, don't attempt to revector anything else.
xen/p2m: When revectoring deal with holes in the P2M array.
xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
xen/mmu: Remove from __ka space PMD entries for pagetables.
xen/mmu: Copy and revector the P2M tree.
xen/p2m: Add logic to revector a P2M tree to use __va leafs.
xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
xen/mmu: For 64-bit do not call xen_map_identity_early
xen/mmu: use copy_page instead of memcpy.
xen/mmu: Provide comments describing the _ka and _va aliasing issue
xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything.
Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and "xen/x86: Use memblock_reserve for sensitive areas."
xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain.
xen/x86: Use memblock_reserve for sensitive areas.
xen/p2m: Fix the comment describing the P2M tree.
Conflicts:
arch/x86/xen/mmu.c
The pagetable_init is the old xen_pagetable_setup_done and xen_pagetable_setup_start
rolled in one.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* 'x86/platform' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (9690 commits)
x86: Document x86_init.paging.pagetable_init()
x86: xen: Cleanup and remove x86_init.paging.pagetable_setup_done()
x86: Move paging_init() call to x86_init.paging.pagetable_init()
x86: Rename pagetable_setup_start() to pagetable_init()
x86: Remove base argument from x86_init.paging.pagetable_setup_start
Linux 3.6-rc5
HID: tpkbd: work even if the new Lenovo Keyboard driver is not configured
Remove user-triggerable BUG from mpol_to_str
xen/pciback: Fix proper FLR steps.
uml: fix compile error in deliver_alarm()
dj: memory scribble in logi_dj
Fix order of arguments to compat_put_time[spec|val]
xen: Use correct masking in xen_swiotlb_alloc_coherent.
xen: fix logical error in tlb flushing
xen/p2m: Fix one-off error in checking the P2M tree directory.
powerpc: Don't use __put_user() in patch_instruction
powerpc: Make sure IPI handlers see data written by IPI senders
powerpc: Restore correct DSCR in context switch
powerpc: Fix DSCR inheritance in copy_thread()
powerpc: Keep thread.dscr and thread.dscr_inherit in sync
...
At this stage x86_init.paging.pagetable_setup_done is only used in the
XEN case. Move its content in the x86_init.paging.pagetable_init setup
function and remove the now unused x86_init.paging.pagetable_setup_done
remaining infrastructure.
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Acked-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-5-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move the paging_init() call to the platform specific pagetable_init()
function, so we can get rid of the extra pagetable_setup_done()
function pointer.
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Acked-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-4-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In preparation for unifying the pagetable_setup_start() and
pagetable_setup_done() setup functions, rename appropriately all the
infrastructure related to pagetable_setup_start().
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Ackedd-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-3-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We either use swapper_pg_dir or the argument is unused. Preparatory
patch to simplify platform pagetable setup further.
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Ackedb-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-2-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQTgR7AAoJEI7yEDeUysxlLnsP/jsGydL5WWsujpK9jdEbNa+r
Y2D8aPgKRe6Z6ZmWpe7jxymXA4haXh1kKO8nIVgfFOgtvN+Po8rZMOQE0TpuulFy
ANYa2N9Mi4+tpi9pqjcVgafjNa7xxRer04jZr5a789ry0MXjKspoUv/rFPpOsxrL
XlaPnRu1MLnCWmnLv3nvgatth3fIjo4X31FGmUwl4dRc59NIQREek7z3gAV0x4td
+JK6qvwzYOz0LhwV/Ssk84Sm97OOfXVaKWvxfC8y0H6nt0nXqmfGI/d5tWL8F9xz
92c0lelvbm6AMgx5yY5onRYDE6SkdlnAS563umIB58jMp44Fj+0MMPSfPQM1rBXC
NZ7mh2BJJ8Dsfi2NVneWW089o+uxSfI9HkuXGEiTzcu2mjISjuJ5ih7efAAmr7xc
e8dMM6fVTpNTazd6VGvsgno8f/Hr+D6qOZwZ8DsH4QLX0NJC41Dm6loHCE3saR2x
tjGwcbJEYqnboXEd09YOGqVOExR111cBkhVmf1qvOB71ENXoMnJ1L3e+AWojc1xN
klaMM8SQd0MsgjqTx4AOpPMKkzmONN1v44sFuQfARR1iwyY6dmLcJTqMQtDG9vJP
YTpZdkmswQrAnE2xVyMWkUd3vFj2d3JRU+aELYfIcVEuxx3hsRZdR8ezcrTRZLST
s4ii8XXmb9MGlggwrmk5
=BPP1
-----END PGP SIGNATURE-----
Merge tag 'kvm-3.6-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Avi Kivity:
"A trio of KVM fixes: incorrect lookup of guest cpuid, an uninitialized
variable fix, and error path cleanup fix."
* tag 'kvm-3.6-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: fix error paths for failed gfn_to_page() calls
KVM: x86: Check INVPCID feature bit in EBX of leaf 7
KVM: PIC: fix use of uninitialised variable.
commit b6069a9570 (filter: add MOD operation) added generic
support for modulus operation in BPF.
This patch brings JIT support for x86_64
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: George Bakos <gbakos@alpinista.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This bug was triggered:
[ 4220.198458] BUG: unable to handle kernel paging request at fffffffffffffffe
[ 4220.203907] IP: [<ffffffff81104d85>] put_page+0xf/0x34
......
[ 4220.237326] Call Trace:
[ 4220.237361] [<ffffffffa03830d0>] kvm_arch_destroy_vm+0xf9/0x101 [kvm]
[ 4220.237382] [<ffffffffa036fe53>] kvm_put_kvm+0xcc/0x127 [kvm]
[ 4220.237401] [<ffffffffa03702bc>] kvm_vcpu_release+0x18/0x1c [kvm]
[ 4220.237407] [<ffffffff81145425>] __fput+0x111/0x1ed
[ 4220.237411] [<ffffffff8114550f>] ____fput+0xe/0x10
[ 4220.237418] [<ffffffff81063511>] task_work_run+0x5d/0x88
[ 4220.237424] [<ffffffff8104c3f7>] do_exit+0x2bf/0x7ca
The test case:
printf(fmt, ##args); \
exit(-1);} while (0)
static int create_vm(void)
{
int sys_fd, vm_fd;
sys_fd = open("/dev/kvm", O_RDWR);
if (sys_fd < 0)
die("open /dev/kvm fail.\n");
vm_fd = ioctl(sys_fd, KVM_CREATE_VM, 0);
if (vm_fd < 0)
die("KVM_CREATE_VM fail.\n");
return vm_fd;
}
static int create_vcpu(int vm_fd)
{
int vcpu_fd;
vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
if (vcpu_fd < 0)
die("KVM_CREATE_VCPU ioctl.\n");
printf("Create vcpu.\n");
return vcpu_fd;
}
static void *vcpu_thread(void *arg)
{
int vm_fd = (int)(long)arg;
create_vcpu(vm_fd);
return NULL;
}
int main(int argc, char *argv[])
{
pthread_t thread;
int vm_fd;
(void)argc;
(void)argv;
vm_fd = create_vm();
pthread_create(&thread, NULL, vcpu_thread, (void *)(long)vm_fd);
printf("Exit.\n");
return 0;
}
It caused by release kvm->arch.ept_identity_map_addr which is the
error page.
The parent thread can send KILL signal to the vcpu thread when it was
exiting which stops faulting pages and potentially allocating memory.
So gfn_to_pfn/gfn_to_page may fail at this time
Fixed by checking the page before it is used
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Checking the return of kvm_mmu_get_page is unnecessary since it is
guaranteed by memory cache
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM lapic timer and tsc deadline timer based on hrtimer,
setting a leftmost node to rb tree and then do hrtimer reprogram.
If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram
do nothing and then make kvm lapic timer and tsc deadline timer fail.
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Some AMD systems may round the frequencies in ACPI tables to 100MHz
boundaries. We can obtain the real frequencies from MSRs, so add a quirk
to fix these frequencies up on AMD systems.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The programming model for P-states on modern AMD CPUs is very similar to
that of Intel and VIA. It makes sense to consolidate this support into one
driver rather than duplicating functionality between two of them. This
patch adds support for AMDs with hardware P-state control to acpi-cpufreq.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Checks and operations on the INVPCID feature bit should use EBX
of CPUID leaf 7 instead of ECX.
Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Yongjie Ren <yongjien.ren@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Since the shift count settable there is used for shifting values
of type "unsigned long", its value must not match or exceed
BITS_PER_LONG (otherwise the shift operations are undefined).
Similarly, the value must not be negative (but -1 must be
permitted, as that's the value used to distinguish the case of
the fine grained flushing being disabled).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/5049B65C020000780009990C@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Fix for TLB flushing introduced in v3.6
* Fix Xen-SWIOTLB not using proper DMA mask - device had 64bit but
in a 32-bit kernel we need to allocate for coherent pages from a
32-bit pool.
* When trying to re-use P2M nodes we had a one-off error and triggered
a BUG_ON check with specific CONFIG_ option.
* When doing FLR in Xen-PCI-backend we would first do FLR then save the
PCI configuration space. We needed to do it the other way around.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJQSS7TAAoJEFjIrFwIi8fJOBQH/03JBKBbFvewhop8T5Jww2c9
SWmMgzDm5HkxeWj5XnM+zlLoHrYFFu7tyuRLiny4weE0LdRl4adJ1TVpStxap/b6
MSQKe+tZevslaReBOsMpbCk3z7fEWNlAcpm6wMp1xYmLoHcr0JMpOCmzigbf7dwM
F4UWULheih9ME3UeqDAU8qgvfv6ccZ9vempO4TDWKjxfxfWODCNMRx+Ny+C7NNRk
QeoInHJUqcRkg0q0OIciF/YYDmn8hIH7HgfqomuMb6rEv2LOieLnWVuniEPWM75s
lECxro7v2Z9s1Std0WWKcFp8VZpI2scnQYU8aL8TpewHcbzKHQYyipuKPu7FiEQ=
=GfLE
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
* Fix for TLB flushing introduced in v3.6
* Fix Xen-SWIOTLB not using proper DMA mask - device had 64bit but
in a 32-bit kernel we need to allocate for coherent pages from a
32-bit pool.
* When trying to re-use P2M nodes we had a one-off error and triggered
a BUG_ON check with specific CONFIG_ option.
* When doing FLR in Xen-PCI-backend we would first do FLR then save the
PCI configuration space. We needed to do it the other way around.
* tag 'stable/for-linus-3.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/pciback: Fix proper FLR steps.
xen: Use correct masking in xen_swiotlb_alloc_coherent.
xen: fix logical error in tlb flushing
xen/p2m: Fix one-off error in checking the P2M tree directory.
Fix "constant 0xXXXXXXXXXXXXXXXX is so big it's unsigned long" sparse warnings.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies, interleaves instructions better for out-of-order scheduling
and merges constant 16-bit rotation with round-key variable rotation.
tcrypt ECB results:
Intel Core i5-2450M:
size old-vs-new new-vs-generic old-vs-generic
enc dec enc dec enc dec
256 1.13x 1.19x 2.05x 2.17x 1.82x 1.82x
1k 1.18x 1.21x 2.26x 2.33x 1.93x 1.93x
8k 1.19x 1.19x 2.32x 2.33x 1.95x 1.95x
[v2]
- Do instruction interleaving another way to avoid adding new FPU<=>CPU
register moves as these cause performance drop on Bulldozer.
- Improvements to round-key variable rotation handling.
- Further interleaving improvements for better out-of-order scheduling.
Cc: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies, interleaves instructions better for out-of-order scheduling
and merges constant 16-bit rotation with round-key variable rotation.
tcrypt ECB results (128bit key):
Intel Core i5-2450M:
size old-vs-new new-vs-generic old-vs-generic
enc dec enc dec enc dec
256 1.18x 1.18x 2.45x 2.47x 2.08x 2.10x
1k 1.20x 1.20x 2.73x 2.73x 2.28x 2.28x
8k 1.20x 1.19x 2.73x 2.73x 2.28x 2.29x
[v2]
- Do instruction interleaving another way to avoid adding new FPU<=>CPU
register moves as these cause performance drop on Bulldozer.
- Improvements to round-key variable rotation handling.
- Further interleaving improvements for better out-of-order scheduling.
Cc: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
interrupt_bitmap is KVM_NR_INTERRUPTS bits in size,
so just use that instead of hard-coded constants
and math.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Optimize "rep ins" by allowing emulator to write back more than one
datum at a time. Introduce new operand type OP_MEM_STR which tells
writeback() that dst contains pointer to an array that should be written
back as opposite to just one data element.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Remove unneeded segment argument. Address structure already has correct
segment which was put there during decode.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Current code assumes that IO exit was due to instruction emulation
and handles execution back to emulator directly. This patch adds new
userspace IO exit completion callback that can be set by any other code
that caused IO exit to userspace.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Other arches do not need this.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
v2: fix incorrect deletion of mmio sptes on gpa move (noticed by Takuya)
Signed-off-by: Avi Kivity <avi@redhat.com>
Introducing kvm_arch_flush_shadow_memslot, to invalidate the
translations of a single memory slot.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Callers of xen_remap_domain_range() need to know if the remap failed
because frame is currently paged out. So they can retry the remap
later on. Return -ENOENT in this case.
This assumes that the error codes returned by Xen are a subset of
those used by the kernel. It is unclear if this is defined as part of
the hypercall ABI.
Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
While TLB_FLUSH_ALL gets passed as 'end' argument to
flush_tlb_others(), the Xen code was made to check its 'start'
parameter. That may give a incorrect op.cmd to MMUEXT_INVLPG_MULTI
instead of MMUEXT_TLB_FLUSH_MULTI. Then it causes some page can not
be flushed from TLB.
This patch fixed this issue.
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Yongjie Ren <yongjie.ren@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* commit '4cb38750d49010ae72e718d46605ac9ba5a851b4': (6849 commits)
bcma: fix invalid PMU chip control masks
[libata] pata_cmd64x: whitespace cleanup
libata-acpi: fix up for acpi_pm_device_sleep_state API
sata_dwc_460ex: device tree may specify dma_channel
ahci, trivial: fixed coding style issues related to braces
ahci_platform: add hibernation callbacks
libata-eh.c: local functions should not be exposed globally
libata-transport.c: local functions should not be exposed globally
sata_dwc_460ex: support hardreset
ata: use module_pci_driver
drivers/ata/pata_pcmcia.c: adjust suspicious bit operation
pata_imx: Convert to clk_prepare_enable/clk_disable_unprepare
ahci: Enable SB600 64bit DMA on MSI K9AGM2 (MS-7327) v2
[libata] Prevent interface errors with Seagate FreeAgent GoFlex
drivers/acpi/glue: revert accidental license-related 6b66d95895 bits
libata-acpi: add missing inlines in libata.h
i2c-omap: Add support for I2C_M_STOP message flag
i2c: Fall back to emulated SMBus if the operation isn't supported natively
i2c: Add SCCB support
i2c-tiny-usb: Add support for the Robofuzz OSIF USB/I2C converter
...
We would traverse the full P2M top directory (from 0->MAX_DOMAIN_PAGES
inclusive) when trying to figure out whether we can re-use some of the
P2M middle leafs.
Which meant that if the kernel was compiled with MAX_DOMAIN_PAGES=512
we would try to use the 512th entry. Fortunately for us the p2m_top_index
has a check for this:
BUG_ON(pfn >= MAX_P2M_PFN);
which we hit and saw this:
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.1.2-OVM x86_64 debug=n Tainted: C ]----
(XEN) CPU: 0
(XEN) RIP: e033:[<ffffffff819cadeb>]
(XEN) RFLAGS: 0000000000000212 EM: 1 CONTEXT: pv guest
(XEN) rax: ffffffff81db5000 rbx: ffffffff81db4000 rcx: 0000000000000000
(XEN) rdx: 0000000000480211 rsi: 0000000000000000 rdi: ffffffff81db4000
(XEN) rbp: ffffffff81793db8 rsp: ffffffff81793d38 r8: 0000000008000000
(XEN) r9: 4000000000000000 r10: 0000000000000000 r11: ffffffff81db7000
(XEN) r12: 0000000000000ff8 r13: ffffffff81df1ff8 r14: ffffffff81db6000
(XEN) r15: 0000000000000ff8 cr0: 000000008005003b cr4: 00000000000026f0
(XEN) cr3: 0000000661795000 cr2: 0000000000000000
Fixes-Oracle-Bug: 14570662
CC: stable@vger.kernel.org # only for v3.5
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We never modify direct_access_msrs[], msrpm_ranges[],
svm_exit_handlers[] or x86_intercept_map[] at runtime.
Mark them r/o.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We use vmcs_field_to_offset_table[], kvm_vmx_segment_fields[] and
kvm_vmx_exit_handlers[] as lookup tables only -- make them r/o.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We never change emulate_ops[] at runtime so it should be r/o.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The opcode tables never change at runtime, therefor mark them const.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
As the the compiler ensures that the memory operand is always aligned
to a 16 byte memory location, use the aligned variant of MOVDQ for
read_sse_reg() and write_sse_reg().
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Some fields can be constified and/or made static to reduce code and data
size.
Numbers for a 32 bit build:
text data bss dec hex filename
before: 3351 80 0 3431 d67 cpuid.o
after: 3391 0 0 3391 d3f cpuid.o
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Deleted the no longer valid example of which x86 CPUs lack a
hardware IOMMU, and moved the "If unsure..." statement to a new
line to follow the style of surrounding options.
Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: team-fjord@googlegroups.com
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1346632700-29113-1-git-send-email-jmillenbach@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
IOMMU_INIT_POST and IOMMU_INIT_POST_FINISH pass the plain value
0 instead of NULL to __IOMMU_INIT. Fix this and make sparse
happy by doing so.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1346621506-30857-8-git-send-email-minipli@googlemail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Don't remove the __user annotation of the fpstate pointer, but
drop the superfluous void * cast instead.
This fixes the following sparse warnings:
xsave.c:135:15: warning: cast removes address space of expression
xsave.c:135:15: warning: incorrect type in argument 1 (different address spaces)
xsave.c:135:15: expected void const volatile [noderef] <asn:1>*<noident>
[...]
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1346621506-30857-6-git-send-email-minipli@googlemail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The address calculated by VDSO32_SYMBOL() is a pointer into
userland. Add the __user annotation to fix related sparse
warnings in its users.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Andy Lutomirski <luto@MIT.EDU>
Link: http://lkml.kernel.org/r/1346621506-30857-3-git-send-email-minipli@googlemail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix the following sparse warnings:
sys_ia32.c:293:38: warning: incorrect type in argument 2 (different address spaces)
sys_ia32.c:293:38: expected unsigned int [noderef] [usertype] <asn:1>*stat_addr
sys_ia32.c:293:38: got unsigned int *stat_addr
Ironically, sys_ia32.h was introduced to fix sparse warnings but
missed that one.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Link: http://lkml.kernel.org/r/1346621506-30857-2-git-send-email-minipli@googlemail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The x86 defconfigs include exactly one module: test_nx.ko, a
special-purpose module which just exists to do evil things like
executing code off the stack to see if the kernel has enabled NX
support. Anyone who actually uses that module can easily enable
it themselves, but the vast majority of kernel builds don't need
it; disable it by default.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/e72faf875e1172fb1cbec5e6d3cd4122df508a97.1346649518.git.josh@joshtriplett.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The vast majority of systems either use initramfs or mount a
root filesystem directly from the kernel. Distros have
defaulted to initramfs for years. Only highly specialized
systems would use an actual filesystem-image initrd at this
point, and such systems don't rely on defconfig anyway. Drop
initrd support (and specifically RAM block device support) from
the defconfigs.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/2521e983a63595cd7a331236d929577660f89c72.1346649518.git.josh@joshtriplett.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_CRC_T10DIF explicitly states that it exists only for use
by out-of-tree modules; anything in-kernel that needs it selects
it. Thus, compile it out by default.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/3aaff7a0af1320427952d411a21b8ded29747a1f.1346649518.git.josh@joshtriplett.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current x86 and x86-64 defconfigs do not enable ext4, which
most current distributions default to. Switch the defconfigs to
ext4, so they will boot on current systems without additional
configuration.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/bd8a359506b7e1287c680823de16d67608ec52fe.1346649518.git.josh@joshtriplett.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The x86 defconfigs have become somewhat out of date compared to
the current result of "make savedefconfig". Update them to the
current output, as a prelude to further defconfig changes, to
avoid unrelated noise in those further changes.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/80c8a5fbeaf6cdb72fb78a016013427efee52668.1346649518.git.josh@joshtriplett.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch enables perf_events support for Intel Cedarview
Atom (model 54) processors. Support includes PEBS and LBR.
Tested on my Atom N2600 netbook.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120820092421.GA11284@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit aea218f3cb (KVM: PIC: call ack notifiers for irqs that are
dropped form irr) used an uninitialised variable to track whether an
appropriate apic had been found. This could result in calling the ack
notifier incorrectly.
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Jamie Iles <jamie@jamieiles.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_pic_reset() is not used anywhere. Move reset logic from
pic_ioport_write() there.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
HPET_ID_VENDOR_8086 is defined but never used. It would be a redefine
of PCI_VENDOR_ID_INTEL if it was ever used.
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
We will enter the guest with G and D cleared; as real hardware ignores D in
real mode, and G is taken care of by the limit test, we allow more code to
run in vm86 mode.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
While this is undocumented, real processors do not reload the segment
limit and access rights when loading a segment register in real mode.
Real programs rely on it so we need to comply with this behaviour.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
emulate_invalid_guest_state=1 doesn't mean we don't munge the segments in the
vmcs; we do. So we need to return the real ones (maintained by vmx_set_segment).
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Segment limits are verified in real mode, not just protected mode.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When loading a segment in real mode, only the base and selector must
be modified. The limit needs to be left alone, otherwise big real mode
users will hit a #GP due to limit checking (currently this is suppressed
because we don't check limits in real mode).
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Usually, big real mode uses large (4GB) segments. Currently we don't
virtualize this; if any segment has a limit other than 0xffff, we emulate.
But if we set the vmx-visible limit to 0xffff, we can use vm86 to virtualize
real mode; if an access overruns the segment limit, the guest will #GP, which
we will trap and forward to the emulator. This results in significantly
faster execution, and less risk of hitting an unemulated instruction.
If the limit is less than 0xffff, we retain the existing behaviour.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Real mode is always entered from protected mode with dpl=0. Since
the dpl doesn't affect execution, and we already override it to 3
in the vmcs (as vmx requires), we can allow execution in that state.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Real processors don't change segment limits and attributes while in
real mode. Mimic that behaviour.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Instead of using struct kvm_save_segment, use struct kvm_segment, which is what
the other APIs use. This leads to some simplification.
We replace save_rmode_seg() with a call to vmx_save_segment(). Since this depends
on rmode.vm86_active, we move the call to before setting the flag.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
fix_pmode_dataseg() looks up S in ->base instead of ->ar_bytes.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Commit b246dd5df1 ("KVM: VMX: Fix KVM_SET_SREGS with big real mode
segments") moved fix_rmode_seg() to vmx_set_segment(), so that it is
applied not just on transitions to real mode, but also on KVM_SET_SREGS
(migration). However fix_rmode_seg() not only munges the vmcs segments,
it also sets up the save area for us to restore when returning to
protected mode or to return in vmx_get_segment().
Move saving the segment into a new function, save_rmode_seg(), and
call it just during the transition.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Instead of populating the entire register file, read in registers
as they are accessed, and write back only the modified ones. This
saves a VMREAD and VMWRITE on Intel (for rsp, since it is not usually
used during emulation), and a two 128-byte copies for the registers.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM_GET_MSR was missing support for PV EOI,
which is needed for migration.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The following patch makes the microcode update code path
actually invoke the perf_check_microcode() function and
thus potentially renabling SNB PEBS.
By default, CONFIG_MICROCODE_OLD_INTERFACE is
forced to Y in arch/x86/Kconfig. There is no
way to disable this. That means that the code
path used in arch/x86/kernel/microcode_core.c
did not include the call to perf_check_microcode().
Thus, even though the microcode was updated to a
version that fixes the SNB PEBS problem, perf_event
would still return EOPNOTSUPP when enabling precise
sampling.
This patch simply adds a call to perf_check_microcode()
in the call path used when OLD_INTERFACE=y.
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: peterz@infradead.org
Cc: andi@firstfloor.org
Link: http://lkml.kernel.org/r/20120824133434.GA8014@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merging critical fixes from upstream required for development.
* upstream/master: (809 commits)
libata: Add a space to " 2GB ATA Flash Disk" DMA blacklist entry
Revert "powerpc: Update g5_defconfig"
powerpc/perf: Use pmc_overflow() to detect rolled back events
powerpc: Fix VMX in interrupt check in POWER7 copy loops
powerpc: POWER7 copy_to_user/copy_from_user patch applied twice
powerpc: Fix personality handling in ppc64_personality()
powerpc/dma-iommu: Fix IOMMU window check
powerpc: Remove unnecessary ifdefs
powerpc/kgdb: Restore current_thread_info properly
powerpc/kgdb: Bail out of KGDB when we've been triggered
powerpc/kgdb: Do not set kgdb_single_step on ppc
powerpc/mpic_msgr: Add missing includes
powerpc: Fix null pointer deref in perf hardware breakpoints
powerpc: Fixup whitespace in xmon
powerpc: Fix xmon dl command for new printk implementation
xfs: check for possible overflow in xfs_ioc_trim
xfs: unlock the AGI buffer when looping in xfs_dialloc
xfs: fix uninitialised variable in xfs_rtbuf_get()
powerpc/fsl: fix "Failed to mount /dev: No such device" errors
powerpc/fsl: update defconfigs
...
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
- Revert the kexec fix which caused on non-kexec shutdowns a race.
- Reuse existing P2M leafs - instead of requiring to allocate a large
area of bootup virtual address estate.
- Fix a one-off error when adding PFNs for balloon pages.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJQNppKAAoJEFjIrFwIi8fJU/oH/jdWdRqJgC5mCnu9LwrIemEj
gPTAcKw01A/2vbOY5rfXx7rCpgeU5ZM/XSt0byz/J5q0bmjjKVM106Smq1s7EaQx
OjsdLglWoZYzKJjXH/FEKRPD39f/hd+KNJu3aGEJM8UZ0htvxlg6ACGzVPJa83Pf
yrRXSycxvEevbGbuwWdNubxD5WKMMmbzi/HGGfdtL4256d0xIgxMrYgskLek96cR
cg11llC5QLzH8mX+M5iX0lchASvMITyERXyEKK2opFN8a/766yi16agP75RKZdkP
kWXp0vyOMrpy9UnOs2V1XLc/ufqNwHLcPVfecScXhz8xZWrZYOBdJQf7HAWxvLE=
=MgvT
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull three xen bug-fixes from Konrad Rzeszutek Wilk:
- Revert the kexec fix which caused on non-kexec shutdowns a race.
- Reuse existing P2M leafs - instead of requiring to allocate a large
area of bootup virtual address estate.
- Fix a one-off error when adding PFNs for balloon pages.
* tag 'stable/for-linus-3.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M.
xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID.
Revert "xen PVonHVM: move shared_info to MMIO before kexec"
Pull kvm fixes from Marcelo Tosatti.
* git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86 emulator: use stack size attribute to mask rsp in stack ops
KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended
ppc: e500_tlb memset clears nothing
KVM: PPC: Add cache flush on page map
KVM: PPC: Book3S HV: Fix incorrect branch in H_CEDE code
KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct value
If the P2M revectoring would fail, we would try to continue on by
cleaning the PMD for L1 (PTE) page-tables. The xen_cleanhighmap
is greedy and erases the PMD on both boundaries. Since the P2M
array can share the PMD, we would wipe out part of the __ka
that is still used in the P2M tree to point to P2M leafs.
This fixes it by bypassing the revectoring and continuing on.
If the revector fails, a nice WARN is printed so we can still
troubleshoot this.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When we free the PFNs and then subsequently populate them back
during bootup:
Freeing 20000-20200 pfn range: 512 pages freed
1-1 mapping on 20000->20200
Freeing 40000-40200 pfn range: 512 pages freed
1-1 mapping on 40000->40200
Freeing bad80-badf4 pfn range: 116 pages freed
1-1 mapping on bad80->badf4
Freeing badf6-bae7f pfn range: 137 pages freed
1-1 mapping on badf6->bae7f
Freeing bb000-100000 pfn range: 282624 pages freed
1-1 mapping on bb000->100000
Released 283999 pages of unused memory
Set 283999 page(s) to 1-1 mapping
Populating 1acb8a-1f20e9 pfn range: 283999 pages added
We end up having the P2M array (that is the one that was
grafted on the P2M tree) filled with IDENTITY_FRAME or
INVALID_P2M_ENTRY) entries. The patch titled
"xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID."
recycles said slots and replaces the P2M tree leaf's with
&mfn_list[xx] with p2m_identity or p2m_missing.
And re-uses the P2M array sections for other P2M tree leaf's.
For the above mentioned bootup excerpt, the PFNs at
0x20000->0x20200 are going to be IDENTITY based:
P2M[0][256][0] -> P2M[0][257][0] get turned in IDENTITY_FRAME.
We can re-use that and replace P2M[0][256] to point to p2m_identity.
The "old" page (the grafted P2M array provided by Xen) that was at
P2M[0][256] gets put somewhere else. Specifically at P2M[6][358],
b/c when we populate back:
Populating 1acb8a-1f20e9 pfn range: 283999 pages added
we fill P2M[6][358][0] (and P2M[6][358], P2M[6][359], ...) with
the new MFNs.
That is all OK, except when we revector we assume that the PFN
count would be the same in the grafted P2M array and in the
newly allocated. Since that is no longer the case, as we have
holes in the P2M that point to p2m_missing or p2m_identity we
have to take that into account.
[v2: Check for overflow]
[v3: Move within the __va check]
[v4: Fix the computation]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We call memblock_reserve for [start of mfn list] -> [PMD aligned end
of mfn list] instead of <start of mfn list> -> <page aligned end of mfn list].
This has the disastrous effect that if at bootup the end of mfn_list is
not PMD aligned we end up returning to memblock parts of the region
past the mfn_list array. And those parts are the PTE tables with
the disastrous effect of seeing this at bootup:
Write protecting the kernel read-only data: 10240k
Freeing unused kernel memory: 1860k freed
Freeing unused kernel memory: 200k freed
(XEN) mm.c:2429:d0 Bad type (saw 1400000000000002 != exp 7000000000000000) for mfn 116a80 (pfn 14e26)
...
(XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0
(XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0
.. and so on.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Please first read the description in "xen/mmu: Copy and revector the
P2M tree."
At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).
The xen_remove_p2m_tree and code around has ripped out the __ka for
the old P2M array.
Here we continue on doing it to where the Xen page-tables were.
It is safe to do it, as the page-tables are addressed using __va.
For good measure we delete anything that is within MODULES_VADDR
and up to the end of the PMD.
At this point the __ka only contains PMD entries for the start
of the kernel up to __brk.
[v1: Per Stefano's suggestion wrapped the MODULES_VADDR in debug]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Please first read the description in "xen/p2m: Add logic to revector a
P2M tree to use __va leafs" patch.
The 'xen_revector_p2m_tree()' function allocates a new P2M tree
copies the contents of the old one in it, and returns the new one.
At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).
We have revectored the P2M tree (and the one for save/restore as well)
to use new shiny __va address to new MFNs. The xen_start_info
has been taken care of already in 'xen_setup_kernel_pagetable()' and
xen_start_info->shared_info in 'xen_setup_shared_info()', so
we are free to roam and delete PMD entries - which is exactly what
we are going to do. We rip out the __ka for the old P2M array.
[v1: Fix smatch warnings]
[v2: memset was doing 0 instead of 0xff]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
During bootup Xen supplies us with a P2M array. It sticks
it right after the ramdisk, as can be seen with a 128GB PV guest:
(certain parts removed for clarity):
xc_dom_build_image: called
xc_dom_alloc_segment: kernel : 0xffffffff81000000 -> 0xffffffff81e43000 (pfn 0x1000 + 0xe43 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000
xc_dom_alloc_segment: ramdisk : 0xffffffff81e43000 -> 0xffffffff925c7000 (pfn 0x1e43 + 0x10784 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000
xc_dom_alloc_segment: phys2mach : 0xffffffff925c7000 -> 0xffffffffa25c7000 (pfn 0x125c7 + 0x10000 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000
xc_dom_alloc_page : start info : 0xffffffffa25c7000 (pfn 0x225c7)
xc_dom_alloc_page : xenstore : 0xffffffffa25c8000 (pfn 0x225c8)
xc_dom_alloc_page : console : 0xffffffffa25c9000 (pfn 0x225c9)
nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s)
nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s)
xc_dom_alloc_segment: page tables : 0xffffffffa25ca000 -> 0xffffffffa26e1000 (pfn 0x225ca + 0x117 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000
xc_dom_alloc_page : boot stack : 0xffffffffa26e1000 (pfn 0x226e1)
xc_dom_build_image : virt_alloc_end : 0xffffffffa26e2000
xc_dom_build_image : virt_pgtab_end : 0xffffffffa2800000
So the physical memory and virtual (using __START_KERNEL_map addresses)
layout looks as so:
phys __ka
/------------\ /-------------------\
| 0 | empty | 0xffffffff80000000|
| .. | | .. |
| 16MB | <= kernel starts | 0xffffffff81000000|
| .. | | |
| 30MB | <= kernel ends => | 0xffffffff81e43000|
| .. | & ramdisk starts | .. |
| 293MB | <= ramdisk ends=> | 0xffffffff925c7000|
| .. | & P2M starts | .. |
| .. | | .. |
| 549MB | <= P2M ends => | 0xffffffffa25c7000|
| .. | start_info | 0xffffffffa25c7000|
| .. | xenstore | 0xffffffffa25c8000|
| .. | cosole | 0xffffffffa25c9000|
| 549MB | <= page tables => | 0xffffffffa25ca000|
| .. | | |
| 550MB | <= PGT end => | 0xffffffffa26e1000|
| .. | boot stack | |
\------------/ \-------------------/
As can be seen, the ramdisk, P2M and pagetables are taking
a bit of __ka addresses space. Which is a problem since the
MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
right in there! This results during bootup with the inability to
load modules, with this error:
------------[ cut here ]------------
WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370()
Call Trace:
[<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0
[<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[<ffffffff81071a45>] warn_slowpath_null+0x15/0x20
[<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370
[<ffffffff81130c4d>] map_vm_area+0x2d/0x50
[<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250
[<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
[<ffffffff810c6186>] ? load_module+0x66/0x19c0
[<ffffffff8105cadc>] module_alloc+0x5c/0x60
[<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
[<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80
[<ffffffff810c70c3>] load_module+0xfa3/0x19c0
[<ffffffff812491f6>] ? security_file_permission+0x86/0x90
[<ffffffff810c7b3a>] sys_init_module+0x5a/0x220
[<ffffffff815ce339>] system_call_fastpath+0x16/0x1b
---[ end trace fd8f7704fdea0291 ]---
vmalloc: allocation failure, allocated 16384 of 20480 bytes
modprobe: page allocation failure: order:0, mode:0xd2
Since the __va and __ka are 1:1 up to MODULES_VADDR and
cleanup_highmap rids __ka of the ramdisk mapping, what
we want to do is similar - get rid of the P2M in the __ka
address space. There are two ways of fixing this:
1) All P2M lookups instead of using the __ka address would
use the __va address. This means we can safely erase from
__ka space the PMD pointers that point to the PFNs for
P2M array and be OK.
2). Allocate a new array, copy the existing P2M into it,
revector the P2M tree to use that, and return the old
P2M to the memory allocate. This has the advantage that
it sets the stage for using XEN_ELF_NOTE_INIT_P2M
feature. That feature allows us to set the exact virtual
address space we want for the P2M - and allows us to
boot as initial domain on large machines.
So we pick option 2).
This patch only lays the groundwork in the P2M code. The patch
that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
As we are not using them. We end up only using the L1 pagetables
and grafting those to our page-tables.
[v1: Per Stefano's suggestion squashed two commits]
[v2: Per Stefano's suggestion simplified loop]
[v3: Fix smatch warnings]
[v4: Add more comments]
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
B/c we do not need it. During the startup the Xen provides
us with all the initial memory mapped that we need to function.
The initial memory mapped is up to the bootstack, which means
we can reference using __ka up to 4.f):
(from xen/interface/xen.h):
4. This the order of bootstrap elements in the initial virtual region:
a. relocated kernel image
b. initial ram disk [mod_start, mod_len]
c. list of allocated page frames [mfn_list, nr_pages]
d. start_info_t structure [register ESI (x86)]
e. bootstrap page tables [pt_base, CR3 (x86)]
f. bootstrap stack [register ESP (x86)]
(initial ram disk may be ommitted).
[v1: More comments in git commit]
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Which is that the level2_kernel_pgt (__ka virtual addresses)
and level2_ident_pgt (__va virtual address) contain the same
PMD entries. So if you modify a PTE in __ka, it will be reflected
in __va (and vice-versa).
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We don't need to return the new PGD - as we do not use it.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This reverts commit 806c312e50 and
commit 59b294403e.
And also documents setup.c and why we want to do it that way, which
is that we tried to make the the memblock_reserve more selective so
that it would be clear what region is reserved. Sadly we ran
in the problem wherein on a 64-bit hypervisor with a 32-bit
initial domain, the pt_base has the cr3 value which is not
neccessarily where the pagetable starts! As Jan put it: "
Actually, the adjustment turns out to be correct: The page
tables for a 32-on-64 dom0 get allocated in the order "first L1",
"first L2", "first L3", so the offset to the page table base is
indeed 2. When reading xen/include/public/xen.h's comment
very strictly, this is not a violation (since there nothing is said
that the first thing in the page table space is pointed to by
pt_base; I admit that this seems to be implied though, namely
do I think that it is implied that the page table space is the
range [pt_base, pt_base + nt_pt_frames), whereas that
range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
which - without a priori knowledge - the kernel would have
difficulty to figure out)." - so lets just fall back to the
easy way and reserve the whole region.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If the kernel is compiled with gcc 4.6.0 which supports -mfentry,
then use that instead of mcount.
With mcount, frame pointers are forced with the -pg option and we
get something like:
<can_vma_merge_before>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
53 push %rbx
41 51 push %r9
e8 fe 6a 39 00 callq ffffffff81483d00 <mcount>
31 c0 xor %eax,%eax
48 89 fb mov %rdi,%rbx
48 89 d7 mov %rdx,%rdi
48 33 73 30 xor 0x30(%rbx),%rsi
48 f7 c6 ff ff ff f7 test $0xfffffffff7ffffff,%rsi
With -mfentry, frame pointers are no longer forced and the call looks
like this:
<can_vma_merge_before>:
e8 33 af 37 00 callq ffffffff81461b40 <__fentry__>
53 push %rbx
48 89 fb mov %rdi,%rbx
31 c0 xor %eax,%eax
48 89 d7 mov %rdx,%rdi
41 51 push %r9
48 33 73 30 xor 0x30(%rbx),%rsi
48 f7 c6 ff ff ff f7 test $0xfffffffff7ffffff,%rsi
This adds the ftrace hook at the beginning of the function before a
frame is set up, and allows the function callbacks to be able to access
parameters. As kprobes now can use function tracing (at least on x86)
this speeds up the kprobe hooks that are at the beginning of the
function.
Link: http://lkml.kernel.org/r/20120807194100.130477900@goodmis.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer
arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch removes the "return -ENOSYS" for auto_translated_physmap
guests from privcmd_mmap, thus it allows ARM guests to issue privcmd
mmap calls. However privcmd mmap calls are still going to fail for HVM
and hybrid guests on x86 because the xen_remap_domain_mfn_range
implementation is currently PV only.
Changes in v2:
- better commit message;
- return -EINVAL from xen_remap_domain_mfn_range if
auto_translated_physmap.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
All the original Xen headers have xen_pfn_t as mfn and pfn type, however
when they have been imported in Linux, xen_pfn_t has been replaced with
unsigned long. That might work for x86 and ia64 but it does not for arm.
Bring back xen_pfn_t and let each architecture define xen_pfn_t as they
see fit.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When we are finished with return PFNs to the hypervisor, then
populate it back, and also mark the E820 MMIO and E820 gaps
as IDENTITY_FRAMEs, we then call P2M to set areas that can
be used for ballooning. We were off by one, and ended up
over-writting a P2M entry that most likely was an IDENTITY_FRAME.
For example:
1-1 mapping on 40000->40200
1-1 mapping on bc558->bc5ac
1-1 mapping on bc5b4->bc8c5
1-1 mapping on bc8c6->bcb7c
1-1 mapping on bcd00->100000
Released 614 pages of unused memory
Set 277889 page(s) to 1-1 mapping
Populating 40200-40466 pfn range: 614 pages added
=> here we set from 40466 up to bc559 P2M tree to be
INVALID_P2M_ENTRY. We should have done it up to bc558.
The end result is that if anybody is trying to construct
a PTE for PFN bc558 they end up with ~PAGE_PRESENT.
CC: stable@vger.kernel.org
Reported-by-and-Tested-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We still patch SMP instructions to UP variants if we boot with a
single CPU, but not at any other time. In particular, not if we
unplug CPUs to return to a single cpu.
Paul McKenney points out:
mean offline overhead is 6251/48=130.2 milliseconds.
If I remove the alternatives_smp_switch() from the offline
path [...] the mean offline overhead is 550/42=13.1 milliseconds
Basically, we're never going to get those 120ms back, and the
code is pretty messy.
We get rid of:
1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the
documentation is wrong. It's now the default.
2) The skip_smp_alternatives flag used by suspend.
3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end()
which were only used to set this one flag.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paul.mckenney@us.ibm.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The distinction between CONFIG_KVM_CLOCK and CONFIG_KVM_GUEST is
not so clear anymore, as demonstrated by recent bugs caused by poor
handling of on/off combinations of these options.
Merge CONFIG_KVM_CLOCK into CONFIG_KVM_GUEST.
Reported-By: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Limit the access to userspace only on the BSP where we load the
container, verify the patches in it and put them in the patch cache.
Then, at application time, we lookup the correct patch in the cache and
use it.
When we need to reload the userspace container, we do that over the
reload interface:
echo 1 > /sys/devices/system/cpu/microcode/reload
which reloads (a possibly newer) container from userspace and applies
then the newest patches from there.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-13-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is a trivial cache which collects all ucode patches for the current
family of CPUs on the system. If a newer patch appears due to the
container file being updated in userspace, we replace our cached version
with the new one.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-12-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We search the equivalence table using the CPUID(1) signature of the
CPU in order to get the equivalence ID of the patch which we need to
apply. Add a function which does the reverse - it will be needed in
later patches.
While at it, pull the other equiv table function up in the file so that
it can be used by other functionality without forward declarations.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-11-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is done in preparation for teaching the ucode driver to either load
a new ucode patches container from userspace or use an already cached
version. No functionality change in this patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-10-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Read the CPUID(1).EAX leaf at the correct cpu and use it to search the
equivalence table for matching microcode patch. No functionality change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-9-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Make sure we're actually applying a microcode patch to a core which
really needs it.
This brings only a very very very minor slowdown on F10:
0.032218828 sec vs 0.056010626 sec with this patch.
And small speedup on F15:
0.487089449 sec vs 0.180551162 sec (from perf output).
Also, fixup comments while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-8-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
get_ucode_data was a trivial memcpy wrapper. Remove it so as not to
obfuscate code unnecessarily with no obvious gain.
No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-7-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Mask out CPU_TASKS_FROZEN bit so that all _FROZEN cases can be dropped.
Also, add some more comments as to why CPU_ONLINE falls through to
CPU_DOWN_FAILED (no break), and for the CPU_DEAD case. Realign debug
printks better.
Idea blatantly stolen from a tglx patch:
http://marc.info/?l=linux-kernel&m=134267779513862
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-5-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Remove the uci->mc check on the cpu resume path because the low-level
drivers do that anyway.
More importantly, though, this fixes a contrived and obscure but still
important case. Imagine the following:
* boot machine, no new microcode in /lib/firmware
* a subset of the CPUs is offlined
* in the meantime, user puts new fresh microcode container into
/lib/firmware and reloads it by doing
$ echo 1 > /sys/devices/system/cpu/microcode/reload
* offlined cores come back online and they don't get the newer microcode
applied due to this check.
Later patches take care of the issue on AMD.
While at it, cleanup code around it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-4-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Invert the uci->valid check so that the later block can be aligned on
the first indentation level of the function. No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-3-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This issue was recently observed on an AMD C-50 CPU where a patch of
maximum size was applied.
Commit be62adb492 ("x86, microcode, AMD: Simplify ucode verification")
added current_size in get_matching_microcode(). This is calculated as
size of the ucode patch + 8 (ie. size of the header). Later this is
compared against the maximum possible ucode patch size for a CPU family.
And of course this fails if the patch has already maximum size.
Cc: <stable@vger.kernel.org> [3.3+]
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-1-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The sub-register used to access the stack (sp, esp, or rsp) is not
determined by the address size attribute like other memory references,
but by the stack segment's B bit (if not in x86_64 mode).
Fix by using the existing stack_mask() to figure out the correct mask.
This long-existing bug was exposed by a combination of a27685c33a
(emulate invalid guest state by default), which causes many more
instructions to be emulated, and a seabios change (possibly a bug) which
causes the high 16 bits of esp to become polluted across calls to real
mode software interrupts.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Although the possible race described in
commit 85b7059169
KVM: MMU: fix shrinking page from the empty mmu
was correct, the real cause of that issue was a more trivial bug of
mmu_shrink() introduced by
commit 1952639665
KVM: MMU: do not iterate over all VMs in mmu_shrink()
Here is the bug:
if (kvm->arch.n_used_mmu_pages > 0) {
if (!nr_to_scan--)
break;
continue;
}
We skip VMs whose n_used_mmu_pages is not zero and try to shrink others:
in other words we try to shrink empty ones by mistake.
This patch reverses the logic so that mmu_shrink() can free pages from
the first VM whose n_used_mmu_pages is not zero. Note that we also add
comments explaining the role of nr_to_scan which is not practically
important now, hoping this will be improved in the future.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In current code, if we map a readonly memory space from host to guest
and the page is not currently mapped in the host, we will get a fault
pfn and async is not allowed, then the vm will crash
We introduce readonly memory region to map ROM/ROMD to the guest, read access
is happy for readonly memslot, write access on readonly memslot will cause
KVM_EXIT_MMIO exit
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently, we reexecute all unhandleable instructions if they do not
access on the mmio, however, it can not work if host map the readonly
memory to guest. If the instruction try to write this kind of memory,
it will fault again when guest retry it, then we will goto a infinite
loop: retry instruction -> write #PF -> emulation fail ->
retry instruction -> ...
Fix it by retrying the instruction only when it faults on the writable
memory
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Probably a leftover from the early days of self-patching, p6nops
are marked __initconst_or_module, which causes them to be
discarded in a non-modular kernel. If something later triggers
patching, it will overwrite kernel code with garbage.
Reported-by: Tomas Racek <tracek@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Cc: Michael Tokarev <mjt@tls.msk.ru>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: qemu-devel@nongnu.org
Cc: Anthony Liguori <anthony@codemonkey.ws>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/5034AE84.90708@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When one CPU is going down and this CPU is the last one in irq
affinity, current code is setting cpu_all_mask as the new
affinity for that irq.
But for some systems (such as in Medfield Android mobile) the
firmware sends the interrupt to each CPU in the irq affinity
mask, averaged, and cpu_all_mask includes all potential CPUs,
i.e. offline ones as well.
So replace cpu_all_mask with cpu_online_mask.
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Acked-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A137286@SHSMSX101.ccr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This comment is no longer true. We support up to 2^16 CPUs
because __ticket_t is an u16 if NR_CPUS is larger than 256.
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Each page mapped in a process's address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straightforward but hugetlbfs page table sharing is different. The page
table pages at the PMD level are reference counted while the mapcount
remains the same.
If this accounting is wrong, it causes bugs like this one reported by
Larry Woodman:
kernel BUG at mm/filemap.c:135!
invalid opcode: 0000 [#1] SMP
CPU 22
Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
RIP: 0010:[<ffffffff8112cfed>] [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
Call Trace:
delete_from_page_cache+0x40/0x80
truncate_hugepages+0x115/0x1f0
hugetlbfs_evict_inode+0x18/0x30
evict+0x9f/0x1b0
iput_final+0xe3/0x1e0
iput+0x3e/0x50
d_kill+0xf8/0x110
dput+0xe2/0x1b0
__fput+0x162/0x240
During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is
with a different process entirely then this check fails as in this
diagram:
parent
|
------------>pmd
src_pte----------> data page
^
other--------->pmd--------------------|
^
child-----------|
dst_pte
For this situation to occur, it must be possible for Parent and Other to
have faulted and failed to share page tables with each other. This is
possible due to the following style of race.
PROC A PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
src_pte == huge_pte_offset src_pte == huge_pte_offset
!src_pte so no sharing !src_pte so no sharing
(time passes)
hugetlb_fault hugetlb_fault
huge_pte_alloc huge_pte_alloc
huge_pmd_share huge_pmd_share
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
pmd_alloc pmd_alloc
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)
These two processes are not poing to the same data page but are not
sharing page tables because the opportunity was missed. When either
process later forks, the src_pte == dst pte is potentially insufficient.
As the check falls through, the wrong PTE information is copied in
(harmless but wrong) and the mapcount is bumped for a page mapped by a
shared page table leading to the BUG_ON.
This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same critical
section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now which in turn means that the
success of the sharing will be higher as the racing tasks see the pud
and pmd populated together.
Race identified and changelog written mostly by Mel Gorman.
{akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
Reported-by: Larry Woodman <lwoodman@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Ken Chen <kenchen@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The former conversion to irq_domain_add_legacy() did not fully work
since we miss the irq decs for NR_IRQS_LEGACY+.
Ideally we could use irq_domain_add_simple() or the no-map variant (and
program the virq <-> line mapping directly into ioapic) but this would
require a different irq lookup in "do_IRQ()" and won't work with ACPI
without changes. So this is probably easiest for everyone.
Tested-by: Thierry Reding <thierry.reding@avionic-design.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Grant Likely <grant.likely@secretlab.ca>
Link: http://lkml.kernel.org/r/20120813202304.GA3529@breakpoint.cc
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is no need for those functions/variables to be visible. Make them
static and also fix the compile warnings of this sort:
drivers/xen/<some file>.c: warning: symbol '<blah>' was not declared. Should it be static?
Some of them just require including the header file that
declares the functions.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If a PV guest is booted the native SWIOTLB should not be
turned on. It does not help us (we don't have any PCI devices)
and it eats 64MB of good memory. In the case of PV guests
with PCI devices we need the Xen-SWIOTLB one.
[v1: Rewrite comment per Stefano's suggestion]
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Its pretty easy:
1). We only check to see if we need Xen SWIOTLB for PV guests.
2). If swiotlb=force or iommu=soft is set, then Xen SWIOTLB will
be enabled.
3). If it is an initial domain, then Xen SWIOTLB will be enabled.
4). Native SWIOTLB must be disabled for PV guests.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If a 64-bit hypervisor is booted with a 32-bit initial domain,
the hypervisor deals with the initial domain as "compat" and
does some extra adjustments (like pagetables are 4 bytes instead
of 8). It also adjusts the xen_start_info->pt_base incorrectly.
When booted with a 32-bit hypervisor (32-bit initial domain):
..
(XEN) Start info: cf831000->cf83147c
(XEN) Page tables: cf832000->cf8b5000
..
[ 0.000000] PT: cf832000 (f832000)
[ 0.000000] Reserving PT: f832000->f8b5000
And with a 64-bit hypervisor:
(XEN) Start info: 00000000cf831000->00000000cf8314b4
(XEN) Page tables: 00000000cf832000->00000000cf8b6000
[ 0.000000] PT: cf834000 (f834000)
[ 0.000000] Reserving PT: f834000->f8b8000
To deal with this, we keep keep track of the highest physical
address we have reserved via memblock_reserve. If that address
does not overlap with pt_base, we have a gap which we reserve.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
instead of a big memblock_reserve. This way we can be more
selective in freeing regions (and it also makes it easier
to understand where is what).
[v1: Move the auto_translate_physmap to proper line]
[v2: Per Stefano suggestion add more comments]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
It mixed up the p2m_mid_missing with p2m_missing. Also
remove some extra spaces.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
. Fix include order for bison/flex-generated C files, from Ben Hutchings
. Build fixes and documentation corrections from David Ahern
. Group parsing support, from Jiri Olsa
. UI/gtk refactorings and improvements from Namhyung Kim
. NULL deref fix for perf script, from Namhyung Kim
. Assorted cleanups from Robert Richter
. Let O= makes handle relative paths, from Steven Rostedt
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJQMkGhAAoJENZQFvNTUqpAqjsQAJE5iD1LFogC8o/WjvRHz0TY
Y0x+sR/XfW61KYpeq5g+UaKuFU3P44ijCoyks3y5sza97DkYgUwMpEHlLXFSM8Pp
sNOapqY57s24nq3MLrhH1V9w+cSE+m2u/Gi5fGLCQekio9gkOBwYxNGk7vpKri/n
LBRsMozBu/mZjMy20uWOb7Uk8xsAToh+TFaAtjyQ9Snn9nNJj49NUAp37uN888H/
ducMLq32HN5v/6Zd3q6IWdDWgZsHLkIa3R5FIs/GNe3Dih07gtYLmDol4ktPbTFm
yoaWpP5wbtu/62EZlJwE393vMuoeqN/96394ZZQGFafhHVxN4+rcBhXbejBs0T2b
wk/0CzntW8bbUAI/cl3SB9aui//FWOxcjG9aDQ7PsmHzPw1Q4VD0F9Mcod4p+dRX
PsA9q/tST1eAiwzWYthDtj81U7iChINcXKhoZn2xn6+0+aMH+6FFNBmCH8MR5aCU
BvrXhTJjvau/Ym/sILl4Tf4wfssTq49yMsn/YKCwLJ0hg0XlTObWfQRy2MOayXH9
NJvUE+9GSXoTEKhmr1AfTYEG9vObaXZyFwAI74xvPPwUYojCb4ZjEKmG0egW+VGk
IJKFCaJZwwVsGau4aIbFAMP12/L8Qs/Ox91ddCJ0j5TIlSGMaqW5lbV1N1crzlTT
a0GsN49NvhbFttBXrcNX
=0a2X
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
* Fix include order for bison/flex-generated C files, from Ben Hutchings
* Build fixes and documentation corrections from David Ahern
* Group parsing support, from Jiri Olsa
* UI/gtk refactorings and improvements from Namhyung Kim
* NULL deref fix for perf script, from Namhyung Kim
* Assorted cleanups from Robert Richter
* Let O= makes handle relative paths, from Steven Rostedt
* perf script python fixes, from Feng Tang.
* Improve 'perf lock' error message when the needed tracepoints
are not present, from David Ahern.
* Initial bash completion support, from Frederic Weisbecker
* Allow building without libelf, from Namhyung Kim.
* Support DWARF CFI based unwind to have callchains when %bp
based unwinding is not possible, from Jiri Olsa.
* Symbol resolution fixes, while fixing support PPC64 files with an .opt ELF
section was the end goal, several fixes for code that handles all
architectures and cleanups are included, from Cody Schafer.
* Add a description for the JIT interface, from Andi Kleen.
* Assorted fixes for Documentation and build in 32 bit, from Robert Richter
* Add support for non-tracepoint events in perf script python, from Feng Tang
* Cache the libtraceevent event_format associated to each evsel early, so that we
avoid relookups, i.e. calling pevent_find_event repeatedly when processing
tracepoint events.
[ This is to reduce the surface contact with libtraceevents and make clear what
is that the perf tools needs from that lib: so far parsing the common and per
event fields. ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull ftrace updates from Steve Rostedt:
" This patch series extends ftrace function tracing utility to be
more dynamic for its users. It allows for data passing to the callback
functions, as well as reading regs as if a breakpoint were to trigger
at function entry.
The main goal of this patch series was to allow kprobes to use ftrace
as an optimized probe point when a probe is placed on an ftrace nop.
With lots of help from Masami Hiramatsu, and going through lots of
iterations, we finally came up with a good solution. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The named constant RAMDISK is unused. It used to set the (obsolete)
kernel boot header field ram_size, but its usage for that purpose got
dropped in commit 5e47c478b0 ("x86: remove
zImage support"). Now remove this constant too.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Link: http://lkml.kernel.org/r/1345396003.1771.9.camel@x61.thuisdomein
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull x86 fixes from Ingo Molnar.
A x32 socket ABI fix with a -stable backport tag among other fixes.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x32: Use compat shims for {g,s}etsockopt
Revert "x86-64/efi: Use EFI to deal with platform wall clock"
x86, apic: fix broken legacy interrupts in the logical apic mode
x86, build: Globally set -fno-pic
x86, avx: don't use avx instructions with "noxsave" boot param
Some of the arguments to {g,s}etsockopt are passed in userland pointers.
If we try to use the 64bit entry point, we end up sometimes failing.
For example, dhcpcd doesn't run in x32:
# dhcpcd eth0
dhcpcd[1979]: version 5.5.6 starting
dhcpcd[1979]: eth0: broadcasting for a lease
dhcpcd[1979]: eth0: open_socket: Invalid argument
dhcpcd[1979]: eth0: send_raw_packet: Bad file descriptor
The code in particular is getting back EINVAL when doing:
struct sock_fprog pf;
setsockopt(s, SOL_SOCKET, SO_ATTACH_FILTER, &pf, sizeof(pf));
Diving into the kernel code, we can see:
include/linux/filter.h:
struct sock_fprog {
unsigned short len;
struct sock_filter __user *filter;
};
net/core/sock.c:
case SO_ATTACH_FILTER:
ret = -EINVAL;
if (optlen == sizeof(struct sock_fprog)) {
struct sock_fprog fprog;
ret = -EFAULT;
if (copy_from_user(&fprog, optval, sizeof(fprog)))
break;
ret = sk_attach_filter(&fprog, sk);
}
break;
arch/x86/syscalls/syscall_64.tbl:
54 common setsockopt sys_setsockopt
55 common getsockopt sys_getsockopt
So for x64, sizeof(sock_fprog) is 16 bytes. For x86/x32, it's 8 bytes.
This comes down to the pointer being 32bit for x32, which means we need
to do structure size translation. But since x32 comes in directly to
sys_setsockopt, it doesn't get translated like x86.
After changing the syscall table and rebuilding glibc with the new kernel
headers, dhcp runs fine in an x32 userland.
Oddly, it seems like Linus noted the same thing during the initial port,
but I guess that was missed/lost along the way:
https://lkml.org/lkml/2011/8/26/452
[ hpa: tagging for -stable since this is an ABI fix. ]
Bugzilla: https://bugs.gentoo.org/423649
Reported-by: Mads <mads@ab3.no>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Link: http://lkml.kernel.org/r/1345320697-15713-1-git-send-email-vapier@gentoo.org
Cc: H. J. Lu <hjl.tools@gmail.com>
Cc: <stable@vger.kernel.org> v3.4..v3.5
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
If P2M leaf is completly packed with INVALID_P2M_ENTRY or with
1:1 PFNs (so IDENTITY_FRAME type PFNs), we can swap the P2M leaf
with either a p2m_missing or p2m_identity respectively. The old
page (which was created via extend_brk or was grafted on from the
mfn_list) can be re-used for setting new PFNs.
This also means we can remove git commit:
5bc6f9888d
xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back
which tried to fix this.
and make the amount that is required to be reserved much smaller.
CC: stable@vger.kernel.org # for 3.5 only.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* On machines with large MMIO/PCI E820 spaces we fail to boot b/c
we failed to pre-allocate large enough virtual space for extend_brk.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJQKlV9AAoJEFjIrFwIi8fJZh4H/0ZlRrgG+8mqwCM+pcyYY+2a
zqnOrfYUO/aO26oqiOQUrn4quLAElhBuJK19uSj8fckMMZ+sr5rTJTaXmT6b7F7N
pgTXsKQCYAJ2NNGHVSQ73KYjOUeEW3woDSQZo0y/GRzOjiQsxpoFc8PS94ZieUNT
G6a8ECZBRv3fz8nAuJlhGV/suqHGOLJ0pwum1gHGOzaH3ZoZVtaQv5LhGYctJspU
yF5bdeD0qjCbseVtJ72tyxzLxMwLpJtdy2MbSwIv5JGuszj0nRmL4oa7Vc4vYdyv
p+FrNmbDAZ1j61z1PhBZPmgzwba2LTXtIWhR2zsGJgqlJNzMUtlNkff1kT3NeE0=
=Gl6V
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen fix from Konrad Rzeszutek Wilk:
"Way back in v3.5 we added a mechanism to populate back pages that were
released (they overlapped with MMIO regions), but neglected to reserve
the proper amount of virtual space for extend_brk to work properly.
Coincidentally some other commit aligned the _brk space to larger area
so I didn't trigger this until it was run on a machine with more than
2GB of MMIO space."
* On machines with large MMIO/PCI E820 spaces we fail to boot b/c
we failed to pre-allocate large enough virtual space for extend_brk.
* tag 'stable/for-linus-3.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back.
This reverts commit 00e37bdb01.
During shutdown of PVHVM guests with more than 2VCPUs on certain
machines we can hit the race where the replaced shared_info is not
replaced fast enough and the PV time clock retries reading the same
area over and over without any any success and is stuck in an
infinite loop.
Acked-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
else, host continues to update stealtime after reboot,
which can corrupt e.g. initramfs area.
found when tracking down initramfs unpack error on initial reboot
(with qemu-kvm -smp 2, no problem with single-core).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We validate irq pin number when routing is setup, so
code handling illegal irq # in pic and ioapic on each injection
is never called.
Drop it, replace with BUG_ON to catch out of bounds access bugs.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This reverts commit bacef661ac.
This commit has been found to cause serious regressions on a number of
ASUS machines at the least. We probably need to provide a 1:1 map in
addition to the EFI virtual memory map in order for this to work.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Reported-and-bisected-by: Jérôme Carretero <cJ-ko@zougloub.eu>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120805172903.5f8bb24c@zougloub.eu
Recent commit 332afa656e cleaned up
a workaround that updates irq_cfg domain for legacy irq's that
are handled by the IO-APIC. This was assuming that the recent
changes in assign_irq_vector() were sufficient to remove the workaround.
But this broke couple of AMD platforms. One of them seems to be
sending interrupts to the offline cpu's, resulting in spurious
"No irq handler for vector xx (irq -1)" messages when those cpu's come online.
And the other platform seems to always send the interrupt to the last logical
CPU (cpu-7). Recent changes had an unintended side effect of using only logical
cpu-0 in the IO-APIC RTE (during boot for the legacy interrupts) and this
broke the legacy interrupts not getting routed to the cpu-7 on the AMD
platform, resulting in a boot hang.
For now, reintroduce the removed workaround, (essentially not allowing the
vector to change for legacy irq's when io-apic starts to handle the irq. Which
also addressed the uninteded sife effect of just specifying cpu-0 in the
IO-APIC RTE for those irq's during boot).
Reported-and-tested-by: Robert Richter <robert.richter@amd.com>
Reported-and-tested-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1344453412.29170.5.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
All processors that support VMX have that feature, and guests (Xen) depend on
it. As we already implement it, advertize it to the guest.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
MSR_IA32_DEBUGCTLMSR is zeroed on VMEXIT. Restore it to the correct
value.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm_guest_time_update unconditionally clears hv_clock.flags field,
so the notification never reaches the guest.
Fix it by allowing PVCLOCK_GUEST_STOPPED to passthrough.
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If PMU counter has PEBS enabled it is not enough to disable counter
on a guest entry since PEBS memory write can overshoot guest entry
and corrupt guest memory. Disabling PEBS during guest entry solves
the problem.
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120809085234.GI3341@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The Westmere-EX uncore is similar to the Nehalem-EX uncore. The
differences are:
- Westmere-EX uncore has 10 instances of Cbox. The MSRs for Cbox8
and Cbox9 in the Westmere-EX aren't contiguous with Cbox 0~7.
- The fvid field in the ZDP_CTL_FVC register in the Mbox is
different. It's 5 bits in the Nehalem-EX, 6 bits in the
Westmere-EX.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344229882-3907-3-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch includes following fixes and update:
- Only some events in the Sbox and Mbox can use the match/mask
registers, add code to check this.
- The format definitions for xbr_mm_cfg and xbr_match registers
in the Rbox are wrong, xbr_mm_cfg should use 32 bits, xbr_match
should use 64 bits.
- Cleanup the Rbox code. Compute the addresses extra registers in
the enable_event function instead of the hw_config function.
This simplifies the code in nhmex_rbox_alter_er().
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344229882-3907-2-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix the following section mismatch:
WARNING: arch/x86/kernel/cpu/built-in.o(.text+0x7ad9): Section mismatch in reference from the function uncore_types_exit() to the function .init.text:uncore_type_exit()
The function uncore_types_exit() references the function __init
uncore_type_exit(). This is often because uncore_types_exit lacks a
__init annotation or the annotation of uncore_type_exit is wrong.
caused by 14371cce03 ("perf: Add generic PCI uncore PMU device
support").
Cc: Zheng Yan <zheng.z.yan@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-8-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
GCC built with nonstandard options can enable -fpic by default.
We never want this for 32-bit kernels and it will break the build.
[ hpa: Notably the Android toolchain apparently does this. ]
Change-Id: Iaab7d66e598b1c65ac4a4f0229eca2cd3d0d2898
Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
Link: http://lkml.kernel.org/r/1344624546-29691-1-git-send-email-andrew.p.boie@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Introducing PERF_SAMPLE_STACK_USER sample type bit to trigger the dump
of the user level stack on sample. The size of the dump is specified by
sample_stack_user value.
Being able to dump parts of the user stack, starting from the stack
pointer, will be useful to make a post mortem dwarf CFI based stack
unwinding.
Added HAVE_PERF_USER_STACK_DUMP config option to determine if the
architecture provides user stack dump on perf event samples. This needs
access to the user stack pointer which is not unified across
architectures. Enabling this for x86 architecture.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-6-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adding a generic way to use __output_copy function with specific copy
function via DEFINE_PERF_OUTPUT_COPY macro.
Using this to add new __output_copy_user function, that provides output
copy from user pointers. For x86 the copy_from_user_nmi function is used
and __copy_from_user_inatomic for the rest of the architectures.
This new function will be used in user stack dump on sample, coming in
next patches.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-4-git-send-email-jolsa@redhat.com
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Introducing PERF_SAMPLE_REGS_USER sample type bit to trigger the dump of
user level registers on sample. Registers we want to dump are specified
by sample_regs_user bitmask.
Only user level registers are dumped at the moment. Meaning the register
values of the user space context as it was before the user entered the
kernel for whatever reason (syscall, irq, exception, or a PMI happening
in userspace).
The layout of the sample_regs_user bitmap is described in
asm/perf_regs.h for archs that support register dump.
This is going to be useful to bring Dwarf CFI based stack unwinding on
top of samples.
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
[ Dump registers ABI specification. ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Suggested-by: Stephane Eranian <eranian@google.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-3-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This brings a new API to help the selective dump of registers on event
sampling, and its implementation for x86 arch.
Added HAVE_PERF_REGS config option to determine if the architecture
provides perf registers ABI.
The information about desired registers will be passed in u64 mask.
It's up to the architecture to map the registers into the mask bits.
For the x86 arch implementation, both 32 and 64 bit registers bits are
defined within single enum to ensure 64 bit system can provide register
dump for compat task if needed in the future.
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
[ Added missing linux/errno.h include ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-2-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
On Intel systems corrected machine check interrupts (CMCI) may be sent to
multiple logical processors; possibly to all processors on the affected
socket (SDM Volume 3B "15.5.1 CMCI Local APIC Interface"). This means
that a persistent error (such as a stuck bit in ECC memory) may cause
a storm of interrupts that greatly hinders or prevents forward progress
(probably on many processors).
To solve this we keep track of the rate at which each processor sees
CMCI. If we exceed a threshold, we disable CMCI delivery and switch to
polling the machine check banks. If the storm subsides (none of the
affected processors see any more errors for a complete poll interval) we
re-enable CMCI.
[Tony: Added console messages when storm begins/ends and increased storm
threshold from 5 to 15 so we have a few more logged entries before we
disable interrupts and start dropping reports]
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
cmci_discover() works out which machine check banks support CMCI, and
which of those are shared by multiple logical processors. It uses this
information to ensure that exactly one cpu is designated the owner of
each bank so that when interrupts are broadcast to multiple cpus, only one
of them will look in a shared bank to log the error and clear the bank.
At boot time cmci_discover() performs this task silently. But during
certain cpu hotplug operations it prints out a set of summary lines
like this:
CPU 35 MCA banks CMCI:0 CMCI:1 CMCI:3 CMCI:5 CMCI:6 CMCI:7 CMCI:8 CMCI:9 CMCI:10 CMCI:11
CPU 1 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 39 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 38 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 32 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 37 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 36 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 34 MCA banks CMCI:0 CMCI:1 CMCI:3
The value of these messages seems very low. A user might painstakingly
cross-check against the data sheet for a processor to ensure that all
CMCI supported banks are correctly reported, but this seems improbable.
If users really wanted to do this, we should print the information at
boot time too.
Remove the messages.
Signed-off-by: Tony Luck <tony.luck@intel.com>
For apic_set_spiv() to track APIC SW state correctly it needs to see
previous and next values of the spurious vector register, but currently
memset() overwrite the old value before apic_set_spiv() get a chance to
do tracking. Fix it by calling apic_set_spiv() before overwriting old
value.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Clear AVX, AVX2 features along with clearing XSAVE feature bits,
as part of the parsing "noxsave" parameter.
Fixes the kernel boot panic with "noxsave" boot parameter.
We could have checked cpu_has_osxsave along with cpu_has_avx etc, but Peter
mentioned clearing the feature bits will be better for uses like
static_cpu_has() etc.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1343755754.2041.2.camel@sbsiddha-desk.sc.intel.com
Cc: <stable@vger.kernel.org> # v3.5
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Compile events.c on ARM.
Parse, map and enable the IRQ to get event notifications from the device
tree (node "/xen").
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
All the original Xen headers have xen_ulong_t as unsigned long type, however
when they have been imported in Linux, xen_ulong_t has been replaced with
unsigned long. That might work for x86 and ia64 but it does not for arm.
Bring back xen_ulong_t and let each architecture define xen_ulong_t as they
see fit.
Also explicitly size pointers (__DEFINE_GUEST_HANDLE) to 64 bit.
Changes in v3:
- remove the incorrect changes to multicall_entry;
- remove the change to apic_physbase.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Run the mprotect.c microbenchmark on all our families >= K8 and preset
the flushall shift variable accordingly.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344272439-29080-5-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Push the max CPUID leaf check into the ->detect_tlb function and remove
general test case from the generic path.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344272439-29080-3-git-send-email-bp@amd64.org
Acked-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The TLB characteristics appeared like this in dmesg:
[ 0.065817] Last level iTLB entries: 4KB 512, 2MB 1024, 4MB 512
[ 0.065817] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 512
[ 0.065817] tlb_flushall_shift is 0xffffffff
where tlb_flushall_shift is actually -1 but dumped as a hex number.
However, the Kconfig option CONFIG_DEBUG_TLBFLUSH and the rest of the
code treats this as a signed decimal and states "If you set it to -1,
the code flushes the whole TLB unconditionally."
So, fix its formatting in accordance with the other references to it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344272439-29080-2-git-send-email-bp@amd64.org
Acked-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Those functions are used during interrupt injection. When inlined they
become nops on the fast path.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Usually all vcpus have local apic pointer initialized, so the check may
be completely skipped.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Usually all APICs are SW enabled so the check can be optimized out.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Usually all APICs are HW enabled so the check can be optimized out.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Do not change apic_base directly. Use kvm_lapic_set_base() instead.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_get_apic_base() needlessly checks irqchip_in_kernel although it does
the same no matter what result of the check is. kvm_set_apic_base() also
checks for irqchip_in_kernel, but kvm_lapic_set_base() can handle this
case.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
After commit a2766325cf, the error page is replaced by the
error code, it need not be released anymore
[ The patch has been compiling tested for powerpc ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
After commit a2766325cf, the error pfn is replaced by the
error code, it need not be released anymore
[ The patch has been compiling tested for powerpc ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Then, get_hwpoison_pfn and is_hwpoison_pfn can be removed
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
After that, the exported and un-inline function, get_fault_pfn,
can be removed
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Two reasons:
- x86 can integrate rmap and rmap_pde and remove heuristics in
__gfn_to_rmap().
- Some architectures do not need rmap.
Since rmap is one of the most memory consuming stuff in KVM, ppc'd
better restrict the allocation to Book3S HV.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
This helps to make rmap architecture specific in a later patch.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead, check npages consistently. This helps to make rmap
architecture specific in a later patch.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
When MSR_KVM_PV_EOI_EN was added to msrs_to_save array
KVM_SAVE_MSRS_BEGIN was not updated accordingly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
- bring back critical fixes (esp. aa67f6096c)
- provide an updated base for development
* upstream: (4334 commits)
missed mnt_drop_write() in do_dentry_open()
UBIFS: nuke pdflush from comments
gfs2: nuke pdflush from comments
drbd: nuke pdflush from comments
nilfs2: nuke write_super from comments
hfs: nuke write_super from comments
vfs: nuke pdflush from comments
jbd/jbd2: nuke write_super from comments
btrfs: nuke pdflush from comments
btrfs: nuke write_super from comments
ext4: nuke pdflush from comments
ext4: nuke write_super from comments
ext3: nuke write_super from comments
Documentation: fix the VM knobs descritpion WRT pdflush
Documentation: get rid of write_super
vfs: kill write_super and sync_supers
ACPI processor: Fix tick_broadcast_mask online/offline regression
ACPI: Only count valid srat memory structures
ACPI: Untangle a return statement for better readability
Linux 3.6-rc1
...
Signed-off-by: Avi Kivity <avi@redhat.com>
Pull ACPI and power management fixes from Len Brown:
"A 3.3 sleep regression fixed, numa bugfix, plus some minor cleanups"
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
ACPI processor: Fix tick_broadcast_mask online/offline regression
ACPI: Only count valid srat memory structures
ACPI: Untangle a return statement for better readability
ACPI / PCI: Do not try to acquire _OSC control if that is hopeless
ACPI: delete _GTS/_BFS support
ACPI/x86: revert 'x86, acpi: Call acpi_enter_sleep_state via an asmlinkage C function from assembler'
ACPI: replace strlen("string") with sizeof("string") -1
ACPI / PM: Fix build warning in sleep.c for CONFIG_ACPI_SLEEP unset
No point in having double cases if we can simply mask the FROZEN bit
out.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Split timer init function into the init and the start part, so the
start part can replace the open coded version in CPU_DOWN_FAILED.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
raise_mce() fiddles with global state, but lacks any kind of
serialization.
Add a mutex around the raise_mce() call, so concurrent writers do not
stomp on each other toes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
raise_mce() has a code path which does not disable preemption when the
raise_local() is called. The per cpu variable access in raise_local()
depends on preemption being disabled to be functional. So that code
path was either never tested or never tested with CONFIG_DEBUG_PREEMPT
enabled.
Add the missing preempt_disable/enable() pair around the call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
When MSR_KVM_PV_EOI_EN was added to msrs_to_save array
KVM_SAVE_MSRS_BEGIN was not updated accordingly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Pull x86 fixes from Ingo Molnar:
"Various fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, kcmp: The kcmp system call can be common
arch/x86/kernel/kdebugfs.c: Ensure a consistent return value in error case
x86/mce: Add quirk for instruction recovery on Sandy Bridge processors
x86/mce: Move MCACOD defines from mce-severity.c to <asm/mce.h>
x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs
x86, nops: Missing break resulting in incorrect selection on Intel
x86: CONFIG_CC_STACKPROTECTOR=y is no longer experimental
Pull perf fixes from Ingo Molnar:
"Fix merge window fallout and fix sleep profiling (this was always
broken, so it's not a fix for the merge window - we can skip this one
from the head of the tree)."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/trace: Add ability to set a target task for events
perf/x86: Fix USER/KERNEL tagging of samples properly
perf/x86/intel/uncore: Make UNCORE_PMU_HRTIMER_INTERVAL 64-bit
Otherwise you could run into:
WARN_ON in numa_register_memblks(), because node_possible_map is zero
References: https://bugzilla.novell.com/show_bug.cgi?id=757888
On this machine (ProLiant ML570 G3) the SRAT table contains:
- No processor affinities
- One memory affinity structure (which is set disabled)
CC: Per Jessen <per@opensuse.org>
CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Pull OLPC platform updates from Andres Salomon:
"These move the OLPC Embedded Controller driver out of
arch/x86/platform and into drivers/platform/olpc.
OLPC machines are now ARM-based (which means lots of x86 and ARM
changes), but are typically pretty self-contained.. so it makes more
sense to go through a separate OLPC tree after getting the appropriate
review/ACKs."
* 'for-linus-3.6' of git://dev.laptop.org/users/dilinger/linux-olpc:
x86: OLPC: move s/r-related EC cmds to EC driver
Platform: OLPC: move global variables into priv struct
Platform: OLPC: move debugfs support from x86 EC driver
x86: OLPC: switch over to using new EC driver on x86
Platform: OLPC: add a suspended flag to the EC driver
Platform: OLPC: turn EC driver into a platform_driver
Platform: OLPC: allow EC cmd to be overridden, and create a workqueue to call it
drivers: OLPC: update various drivers to include olpc-ec.h
Platform: OLPC: add a stub to drivers/platform/ for the OLPC EC driver
When we release pages back during bootup:
Freeing 9d-100 pfn range: 99 pages freed
Freeing 9cf36-9d0d2 pfn range: 412 pages freed
Freeing 9f6bd-9f6bf pfn range: 2 pages freed
Freeing 9f714-9f7bf pfn range: 171 pages freed
Freeing 9f7e0-9f7ff pfn range: 31 pages freed
Freeing 9f800-100000 pfn range: 395264 pages freed
Released 395979 pages of unused memory
We then try to populate those pages back. In the P2M tree however
the space for those leafs must be reserved - as such we use extend_brk.
We reserve 8MB of _brk space, which means we can fit over
1048576 PFNs - which is more than we should ever need.
Without this, on certain compilation of the kernel we would hit:
(XEN) domain_crash_sync called from entry.S
(XEN) CPU: 0
(XEN) RIP: e033:[<ffffffff818aad3b>]
(XEN) RFLAGS: 0000000000000206 EM: 1 CONTEXT: pv guest
(XEN) rax: ffffffff81a7c000 rbx: 000000000000003d rcx: 0000000000001000
(XEN) rdx: ffffffff81a7b000 rsi: 0000000000001000 rdi: 0000000000001000
(XEN) rbp: ffffffff81801cd8 rsp: ffffffff81801c98 r8: 0000000000100000
(XEN) r9: ffffffff81a7a000 r10: 0000000000000001 r11: 0000000000000003
(XEN) r12: 0000000000000004 r13: 0000000000000004 r14: 000000000000003d
(XEN) r15: 00000000000001e8 cr0: 000000008005003b cr4: 00000000000006f0
(XEN) cr3: 0000000125803000 cr2: 0000000000000000
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033
(XEN) Guest stack trace from rsp=ffffffff81801c98:
.. which is extend_brk hitting a BUG_ON.
Interestingly enough, most of the time we are not going to hit this
b/c the _brk space is quite large (v3.5):
ffffffff81a25000 B __brk_base
ffffffff81e43000 B __brk_limit
= ~4MB.
vs earlier kernels (with this back-ported), the space is smaller:
ffffffff81a25000 B __brk_base
ffffffff81a7b000 B __brk_limit
= 344 kBytes.
where we would certainly hit this and hit extend_brk.
Note that git commit c3d93f8801
(xen: populate correct number of pages when across mem boundary (v2))
exposed this bug).
[v1: Made it 8MB of _brk space instead of 4MB per Jan's suggestion]
CC: stable@vger.kernel.org #only for 3.5
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull UML fixes from Richard Weinberger:
"This patch set contains mostly fixes and cleanups. The UML tty driver
uses now tty_port and is no longer broken like hell :-)"
* 'for-linus-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Add arch/x86/um to MAINTAINERS
um: pass siginfo to guest process
um: fix ubd_file_size for read-only files
um: pull interrupt_end() into userspace()
um: split syscall_trace(), pass pt_regs to it
um: switch UPT_SET_RETURN_VALUE and regs_return_value to pt_regs
um: set BLK_CGROUP=y in defconfig
um: remove count_lock
um: fully use tty_port
um: Remove dead code
um: remove line_ioctl()
TTY: um/line, use tty from tty_port
TTY: um/line, add tty_port
Commit b2da15ac26 ("KVM: VMX: Optimize %ds, %es reload") broke i386
in the following scenario:
vcpu_load
...
vmx_save_host_state
vmx_vcpu_run
(ds.rpl, es.rpl cleared by hardware)
interrupt
push ds, es # pushes bad ds, es
schedule
vmx_vcpu_put
vmx_load_host_state
reload ds, es (with __USER_DS)
pop ds, es # of other thread's stack
iret
# other thread runs
interrupt
push ds, es
schedule # back in vcpu thread
pop ds, es # now with rpl=0
iret
...
vcpu_put
resume_userspace
iret # clears ds, es due to mismatched rpl
(instead of resume_userspace, we might return with SYSEXIT and then
take an exception; when the exception IRETs we end up with cleared
ds, es)
Fix by avoiding the optimization on i386 and reloading ds, es on the
lightweight exit path.
Reported-by: Chris Clayron <chris2553@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We already use the same system call handler for i386 and x86-64, there
is absolutely no reason x32 can't use the same system call, too.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: <stable@vger.kernel.org> v3.5
Link: http://lkml.kernel.org/n/tip-vwzk3qbcr3yjyxjg2j38vgy9@git.kernel.org
When a guest migrates to a new host, the system time difference from the
previous host is used in the updates to the kvmclock system time visible
to the guest, resulting in a continuation of correct kvmclock based guest
timekeeping.
The wall clock component of the kvmclock provided time is currently not
updated with this same time offset. Since the Linux guest caches the
wall clock based time, this discrepency is not noticed until the guest is
rebooted. After reboot the guest's time calculations are off.
This patch adjusts the wall clock by the kvmclock_offset, resulting in
correct guest time after a reboot.
Cc: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Bruce Rogers <brogers@suse.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Initialization of cra_list is currently mixed, most ciphers initialize this
field and most shashes do not. Initialization however is not needed at all
since cra_list is initialized/overwritten in __crypto_register_alg() with
list_add(). Therefore perform cleanup to remove all unneeded initializations
of this field in 'arch/x86/crypto/'.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The new EC driver calls platform-specific suspend and resume hooks; run
XO-1-specific EC commands from there, rather than deep in s/r code. If we
attempt to run EC commands after the new EC driver has suspended, it is
refused by the ec->suspended checks.
Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
There's nothing about the debugfs interface for the EC driver that is
architecture-specific, so move it into the arch-independent driver.
The code is mostly unchanged with the exception of renamed variables, coding
style changes, and API updates.
Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
This uses the new EC driver framework in drivers/platform/olpc. The
XO-1 and XO-1.5-specific code is still in arch/x86, but the generic stuff
(including a new workqueue; no more running EC commands with IRQs disabled!)
can be shared with other architectures.
Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Switch over to using olpc-ec.h in multiple steps, so as not to break builds.
This covers every driver that calls olpc_ec_cmd().
Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
The OLPC EC driver has outgrown arch/x86/platform/. It's time to both
share common code amongst different architectures, as well as move it out
of arch/x86/. The XO-1.75 is ARM-based, and the EC driver shares a lot of
code with the x86 code.
Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Paul Fox <pgf@laptop.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
One structure nests inside the other, providing no value at all.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
'timer_mode_mask' is unused
'tscdeadline' is unused
't_ops' only adds needless indirection
'vcpu' is unused
Remove.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
'reinject' is never initialized
't_ops' only serves as indirection to lapic_is_periodic; call that directly
instead
'kvm' is never used
'vcpu' can be derived via container_of
Remove these fields.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm_timer_fn(), the sole inhabitant of timer.c, is only used by lapic.c. Move
it there to make it easier to hack on it.
struct kvm_timer is a thin wrapper around hrtimer, and only adds obfuscation.
Move near its two users (with different names) to prepare for simplification.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Pull perf updates from Ingo Molnar:
"The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
fixes) now that Arnaldo is back from vacation."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
uprobes: __replace_page() needs munlock_vma_page()
uprobes: Rename vma_address() and make it return "unsigned long"
uprobes: Fix register_for_each_vma()->vma_address() check
uprobes: Introduce vaddr_to_offset(vma, vaddr)
uprobes: Teach build_probe_list() to consider the range
uprobes: Remove insert_vm_struct()->uprobe_mmap()
uprobes: Remove copy_vma()->uprobe_mmap()
uprobes: Fix overflow in vma_address()/find_active_uprobe()
uprobes: Suppress uprobe_munmap() from mmput()
uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
uprobes: Clean up and document write_opcode()->lock_page(old_page)
uprobes: Kill write_opcode()->lock_page(new_page)
uprobes: __replace_page() should not use page_address_in_vma()
uprobes: Don't recheck vma/f_mapping in write_opcode()
perf/x86: Fix missing struct before structure name
perf/x86: Fix format definition of SNB-EP uncore QPI box
perf/x86: Make bitfield unsigned
perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
perf/x86: Add Intel Nehalem-EX uncore support
perf/x86: Fix typo in format definition of uncore PCU filter
...
Some PMUs don't provide a full register set for their sample,
specifically 'advanced' PMUs like AMD IBS and Intel PEBS which provide
'better' than regular interrupt accuracy.
In this case we use the interrupt regs as basis and over-write some
fields (typically IP) with different information.
The perf core however uses user_mode() to distinguish user/kernel
samples, user_mode() relies on regs->cs. If the interrupt skid pushed
us over a boundary the new IP might not be in the same domain as the
interrupt.
Commit ce5c1fe9a9 ("perf/x86: Fix USER/KERNEL tagging of samples")
tried to fix this by making the perf core use kernel_ip(). This
however is wrong (TM), as pointed out by Linus, since it doesn't allow
for VM86 and non-zero based segments in IA32 mode.
Therefore, provide a new helper to set the regs->ip field,
set_linear_ip(), which massages the regs into a suitable state
assuming the provided IP is in fact a linear address.
Also modify perf_instruction_pointer() and perf_callchain_user() to
deal with segments base offsets.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341910954.3462.102.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
i386 allmodconfig:
arch/x86/kernel/cpu/perf_event_intel_uncore.c: In function 'uncore_pmu_hrtimer':
arch/x86/kernel/cpu/perf_event_intel_uncore.c:728: warning: integer overflow in expression
arch/x86/kernel/cpu/perf_event_intel_uncore.c: In function 'uncore_pmu_start_hrtimer':
arch/x86/kernel/cpu/perf_event_intel_uncore.c:735: warning: integer overflow in expression
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-h84qlqj02zrojmxxybzmy9hi@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add function tracer based kprobe optimization support
handlers on x86. This allows kprobes to use function
tracer for probing on mcount call.
Link: http://lkml.kernel.org/r/20120605102838.27845.26317.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Updated to new port of ftrace save regs functions ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The graph caller is called by the mcount callers, which already does
the check against the function_trace_stop variable. No reason to
check it again.
Link: http://lkml.kernel.org/r/20120711195745.588538769@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The final position of the stack after saving regs and setting up
the parameters for ftrace_regs_call, is the position of the pt_regs
needed for the 4th parameter. Instead of saving it into a temporary
reg and pushing the reg, simply push the stack pointer.
Link: http://lkml.kernel.org/r/1342702344.12353.16.camel@gandalf.stny.rr.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
cd74257b97
patched up GTS/BFS -- a feature we want to remove.
So revert it (by hand, due to conflict in sleep.h)
to prepare for GTS/BFS removal.
Signed-off-by: Len Brown <len.brown@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Rather than #define the options manually in the architecture code, add
Kconfig options for them and select them there instead. This also allows
us to select the compat IPC version parsing automatically for platforms
using the old compat IPC interface.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two ways to create /sys/firmware/memmap/X sysfs:
- firmware_map_add_early
When the system starts, it is calledd from e820_reserve_resources()
- firmware_map_add_hotplug
When the memory is hot plugged, it is called from add_memory()
But these functions are called without unifying value of end argument as
below:
- end argument of firmware_map_add_early() : start + size - 1
- end argument of firmware_map_add_hogplug() : start + size
The patch unifies them to "start + size". Even if applying the patch,
/sys/firmware/memmap/X/end file content does not change.
[akpm@linux-foundation.org: clarify comments]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE and use this instead
of the multitude of #if defined() checks in atomic64_test.c
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code will update the PPR on almost any APIC read; however
that's only required if we read the PPR.
kvm_update_ppr() shows up in some profiles, albeit with a low usage (~1%).
This should reduce it further (it will still be called during interrupt
processing).
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This allows us in perf to have this:
99.67% [kernel] [k] xen_hypercall_sched_op
0.11% [kernel] [k] xen_hypercall_xen_version
instead of the borring ever-encompassing:
99.13% [kernel] [k] hypercall_page
[v2: Use a macro to define the name and skip]
[v3: Use balign per Jan's suggestion]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull x86/mm changes from Peter Anvin:
"The big change here is the patchset by Alex Shi to use INVLPG to flush
only the affected pages when we only need to flush a small page range.
It also removes the special INVALIDATE_TLB_VECTOR interrupts (32
vectors!) and replace it with an ordinary IPI function call."
Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next
to changed line)
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tlb: Fix build warning and crash when building for !SMP
x86/tlb: do flush_tlb_kernel_range by 'invlpg'
x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
x86/tlb: enable tlb flush range support for x86
mm/mmu_gather: enable tlb flush range in generic mmu_gather
x86/tlb: add tlb_flushall_shift knob into debugfs
x86/tlb: add tlb_flushall_shift for specific CPU
x86/tlb: fall back to flush all when meet a THP large page
x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
x86/tlb_info: get last level TLB entry number of CPU
x86: Add read_mostly declaration/definition to variables from smp.h
x86: Define early read-mostly per-cpu macros
Pul x86/efi changes from Ingo Molnar:
"This tree adds an EFI bootloader handover protocol, which, once
supported on the bootloader side, will make bootup faster and might
result in simpler bootloaders.
The other change activates the EFI wall clock time accessors on x86-64
as well, instead of the legacy RTC readout."
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Handover Protocol
x86-64/efi: Use EFI to deal with platform wall clock
Pull x86 cleanup and cpufeature from Ingo Molnar:
"Just a single cleanup and and a commit that adds new CPU feature
names"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, boot: Remove ancient, unconditionally #ifdef'd out dead code
* 'x86-cpufeature-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpufeature: Add the RDSEED and ADX features
Pull x86/boot changes from Ingo Molnar:
"Kernel image size reduction and assorted fixes and other small
improvements."
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, doc: Assign a bootloader ID for "Minimal Linux Bootloader"
x86, boot: Exclude cmdline.c if you can't use it
x86, boot: Exclude early_serial_console.c if can't use it.
x86, boot: Removed unused debug flag and set code
x86, boot: Switch output functions from command-line flags to conditional compilation
x86, boot: Changed error putstr path to match new debug_putstr format
x86, boot: Wrap debug printing in a new debug_putstr function
x86, boot: Removed quiet flag and switched quiet output to debug flag
Pull scheduler changes from Ingo Molnar:
"The biggest change is a performance improvement on SMP systems:
| 4 socket 40 core + SMT Westmere box, single 30 sec tbench
| runs, higher is better:
|
| clients 1 2 4 8 16 32 64 128
|..........................................................................
| pre 30 41 118 645 3769 6214 12233 14312
| post 299 603 1211 2418 4697 6847 11606 14557
|
| A nice increase in performance.
which speedup is particularly noticeable on heavily interacting
few-tasks workloads, so the changes should help desktop-style Xorg
workloads and interactivity as well, on multi-core CPUs.
There are also cpuset suspend behavior fixes/restructuring and various
smaller tweaks."
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix race in task_group()
sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
sched: Reset loop counters if all tasks are pinned and we need to redo load balance
sched: Reorder 'struct lb_env' members to reduce its size
sched: Improve scalability via 'CPU buddies', which withstand random perturbations
cpusets: Remove/update outdated comments
cpusets, hotplug: Restructure functions that are invoked during hotplug
cpusets, hotplug: Implement cpuset tree traversal in a helper function
CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
sched/x86: Remove broken power estimation
Pull crypto updates from Herbert Xu:
- Fixed algorithm construction hang when self-test fails.
- Added SHA variants to talitos AEAD list.
- New driver for Exynos random number generator.
- Performance enhancements for arc4.
- Added hwrng support to caam.
- Added ahash support to caam.
- Fixed bad kfree in aesni-intel.
- Allow aesni-intel in FIPS mode.
- Added atmel driver with support for AES/3DES/SHA.
- Bug fixes for mv_cesa.
- CRC hardware driver for BF60x family processors.
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (66 commits)
crypto: twofish-avx - remove useless instruction
crypto: testmgr - add aead cbc aes hmac sha1,256,512 test vectors
crypto: talitos - add sha224, sha384 and sha512 to existing AEAD algorithms
crypto: talitos - export the talitos_submit function
crypto: talitos - move talitos structures to header file
crypto: atmel - add new tests to tcrypt
crypto: atmel - add Atmel SHA1/SHA256 driver
crypto: atmel - add Atmel DES/TDES driver
crypto: atmel - add Atmel AES driver
ARM: AT91SAM9G45: add crypto peripherals
crypto: testmgr - allow aesni-intel and ghash_clmulni-intel in fips mode
hwrng: exynos - Add support for Exynos random number generator
crypto: aesni-intel - fix wrong kfree pointer
crypto: caam - ERA retrieval and printing for SEC device
crypto: caam - Using alloc_coherent for caam job rings
crypto: algapi - Fix hang on crypto allocation
crypto: arc4 - now arc needs blockcipher support
crypto: caam - one tasklet per job ring
crypto: caam - consolidate memory barriers from job ring en/dequeue
crypto: caam - only query h/w in job ring dequeue path
...
Typically, the return value desired for the failure of a
function with an integer return value is a negative integer. In
these cases, the return value is sometimes a negative integer
and sometimes 0, due to a subsequent initialization of the
return variable within the loop.
A simplified version of the semantic match that finds this
problem is: (http://coccinelle.lip6.fr/)
//<smpl>
@r exists@
identifier ret;
position p;
constant C;
expression e1,e3,e4;
statement S;
@@
ret = -C
... when != ret = e3
when any
if@p (...) S
... when any
if (\(ret != 0\|ret < 0\|ret > 0\) || ...) { ... return ...; }
... when != ret = e3
when any
*if@p (...)
{
... when != ret = e4
return ret;
}
//</smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Link: http://lkml.kernel.org/r/1342284188-19176-7-git-send-email-Julia.Lawall@lip6.fr
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sandy Bridge processors follow the SDM (Vol 3B, Table 15-20) and
set both the RIPV and EIPV bits in the MCG_STATUS register to
zero for machine checks during instruction fetch. This is more
than a little counter-intuitive and means that Linux cannot
recover from these errors. Rather than insert special case code
at several places in mce.c and mce-severity.c, we pretend the
EIPV bit was set for just this case early in processing the
machine check.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/180a06f3f357cf9f78259ae443a082b14a29535b.1343078495.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We will need some of these values in mce.c. Move them to the
appropriate header file so they are available.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/0ccfb1af5fe35e537b7cd8e4d448bf7d851dbfb9.1343078495.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_PERF_EVENTS disabled, there will have a compiliation
error, because missing struct before structure name.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jiri Kosina <trivial@kernel.org>
Link: http://lkml.kernel.org/r/CACV3sbKF%3DCX%2B2jWEWesfCA6rBoQ3wDM4-5ac9MuBtVbCtMRHdQ@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the current kernel, percpu variable `vector_irq' is not always
cleared when a CPU is offlined. If the CPU that has the disabled
irqs in vector_irq is hotplugged again, __setup_vector_irq()
hits invalid irq vector and may crash.
This bug can be reproduced as following;
# echo 0 > /sys/devices/system/cpu/cpu7/online
# modprobe -r some_driver_using_interrupts # vector_irq@cpu7 uncleared
# echo 1 > /sys/devices/system/cpu/cpu7/online # kernel may crash
To fix this problem, this patch clears vector_irq in
__fixup_irqs() when the CPU is offlined.
This also reverts commit f6175f5bfb, which partially fixes
this bug by clearing vector in __clear_irq_vector(). But in
environments with IOMMU IRQ remapper, it could fail because
cfg->domain doesn't contain offlined CPUs. With this patch, the
fix in __clear_irq_vector() can be reverted because every
vector_irq is already cleared in __fixup_irqs() on offlined CPUs.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120726104732.2889.19144.stgit@kvmdev
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The event control register of SNB-EP uncore QPI box has a one bit
extension at bit position 21.
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1343097850-4348-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
LLC-* and node-* events require using the OFFCORE_RESPONSE events
on SandyBridge, but the hw_cache_extra_regs is left uninitialized.
This patch adds the missing extra register configure table for
SandyBridge.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342517275-2875-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The uncore subsystem in Nehalem-EX consists of 7 components
(U-Box, C-Box, B-Box, S-Box, R-Box, M-Box and W-Box). This
patch is large because the way to program these boxes is
diverse.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FF534F1.3030307@intel.com
[ Improved the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The format definition of uncore PCU filter should be filter_band*
instead of filter_brand*.
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1343024611-4692-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Handle KVM_IRQ_LINE and KVM_IRQ_LINE_STATUS in the generic
kvm_vm_ioctl() function and call into kvm_vm_ioctl_irq_line().
This is even more relevant when KVM/ARM also uses this ioctl.
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
setup_syscalls_segments() calls get_segment() and than overwrites all
but one of the structure fields and this one should also be overwritten
anyway, so we can drop call to get_segment() and avoid a couple of vmreads
on vmx. Also drop zeroing ss/cs structures since most of the fields are
set anyway. Just set those that were not set explicitly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
After commit 242ec97c35 PIT interrupts are no longer delivered after
PIC reset. It happens because PIT injects interrupt only if previous one
was acked, but since on PIC reset it is dropped from irr it will never
be delivered and hence acknowledged. Fix that by calling ack notifier on
PIC reset.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
No need split mmio read region into 8-bits pieces since we do it in
emulator_read_write_onepage
Changelog:
Add a WARN_ON to check read-cache overflow
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The current code depends on the fact that fault_page is the normal page,
however, we will use the error code instead of these dummy pages in the
later patch, so we use kvm_release_pfn_clean to release pfn which will
release the error code properly
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Merge patches queued during the run-up to the merge window.
* queue: (25 commits)
KVM: Choose better candidate for directed yield
KVM: Note down when cpu relax intercepted or pause loop exited
KVM: Add config to support ple or cpu relax optimzation
KVM: switch to symbolic name for irq_states size
KVM: x86: Fix typos in pmu.c
KVM: x86: Fix typos in lapic.c
KVM: x86: Fix typos in cpuid.c
KVM: x86: Fix typos in emulate.c
KVM: x86: Fix typos in x86.c
KVM: SVM: Fix typos
KVM: VMX: Fix typos
KVM: remove the unused parameter of gfn_to_pfn_memslot
KVM: remove is_error_hpa
KVM: make bad_pfn static to kvm_main.c
KVM: using get_fault_pfn to get the fault pfn
KVM: MMU: track the refcount when unmap the page
KVM: x86: remove unnecessary mark_page_dirty
KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()
KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()
KVM: MMU: Add memslot parameter to hva handlers
...
Signed-off-by: Avi Kivity <avi@redhat.com>
The Intel case falls through into the generic case which then changes
the values. For cases like the P6 it doesn't do the right thing so
this seems to be a screwup.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-lww2uirad4skzjlmrm0vru8o@git.kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
The most important part of these updates is the IOMMU groups code
enhancement written by Alex Williamson. It abstracts the problem that a
given hardware IOMMU can't isolate any given device from any other
device (e.g. 32 bit PCI devices can't usually be isolated). Devices that
can't be isolated are grouped together. This code is required for the
upcoming VFIO framework.
Another IOMMU-API change written by be is the introduction of domain
attributes. This makes it easier to handle GART-like IOMMUs with the
IOMMU-API because now the start-address and the size of the domain
address space can be queried.
Besides that there are a few cleanups and fixes for the NVidia Tegra
IOMMU drivers and the reworked init-code for the AMD IOMMU. The later is
from my patch-set to support interrupt remapping. The rest of this
patch-set requires x86 changes which are not mergabe yet. So full
support for interrupt remapping with AMD IOMMUs will come in a future
merge window.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQDV/MAAoJECvwRC2XARrjSDcP+gJbtSHDMyZ71zyfQfAZcxJt
rTqLbdZRtIjrjgtKSEDp8u5Bo5TK9dAYoZVuJMOZewFzwI/fSfbRsWp1PU0I88Fr
ZzM+/o1N9MLvf1e3kRVOzNzUfku+jTQgUBD4txsbtQzc/IeGHe9qP1Bqzs/xg4Pk
SjWu7pLNYxaER10z76nRodNn6zGjsc7GFdOW8cJu2HOAHhisIAR291jSQgd6Rz9r
zWqSTsXIEzYt2CtU3G2/tFJ554Mp8v5F80gHo+0Ldw8aNxlD6nGtbqGNt+KI8qTv
MUL8KJ0TNms9CZdti1CSlSNp51VgJi2GaWKCaDAkYuuER2IbC/8Yp/p2DIIA0GNp
HpziIs+dauZPWfZHc6oU7lJAClGAG4MUx7CysVIOzl7ML/Bf4mjYv0faGf5YQfyE
weOR+OPPIWDUwgjzHKMAboA4ijkE/v+EKjOaN/S9rEqFEMKC99fwGkf9wUcpZTne
8lzdI2JrgYNDWMVNYlomeLD4lBAbxb/QsnRUa33igjr0MclvMDkp5HaO631Z1+Zx
be2z8Rl1CtMwS4qeaOXoeaoNWHU26+oJRZNtCGi/Fw4aKqYXP1dnE/m0GtqEP9Yi
+CU2rKbZn3j0+ZcQjCQop8FREPrZ2/Uaji70b6G7WZ2ApcqBxzBffpbMKOmd6T1D
HIzGh0fpdYNDuwn6Txit
=MbAC
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"The most important part of these updates is the IOMMU groups code
enhancement written by Alex Williamson. It abstracts the problem that
a given hardware IOMMU can't isolate any given device from any other
device (e.g. 32 bit PCI devices can't usually be isolated). Devices
that can't be isolated are grouped together. This code is required
for the upcoming VFIO framework.
Another IOMMU-API change written by me is the introduction of domain
attributes. This makes it easier to handle GART-like IOMMUs with the
IOMMU-API because now the start-address and the size of the domain
address space can be queried.
Besides that there are a few cleanups and fixes for the NVidia Tegra
IOMMU drivers and the reworked init-code for the AMD IOMMU. The
latter is from my patch-set to support interrupt remapping. The rest
of this patch-set requires x86 changes which are not mergabe yet. So
full support for interrupt remapping with AMD IOMMUs will come in a
future merge window."
* tag 'iommu-updates-v3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
iommu/amd: Fix hotplug with iommu=pt
iommu/amd: Add missing spin_lock initialization
iommu/amd: Convert iommu initialization to state machine
iommu/amd: Introduce amd_iommu_init_dma routine
iommu/amd: Move unmap_flush message to amd_iommu_init_dma_ops()
iommu/amd: Split enable_iommus() routine
iommu/amd: Introduce early_amd_iommu_init routine
iommu/amd: Move informational prinks out of iommu_enable
iommu/amd: Split out PCI related parts of IOMMU initialization
iommu/amd: Use acpi_get_table instead of acpi_table_parse
iommu/amd: Fix sparse warnings
iommu/tegra: Don't call alloc_pdir with as->lock
iommu/tegra: smmu: Fix unsleepable memory allocation at alloc_pdir()
iommu/tegra: smmu: Remove unnecessary sanity check at alloc_pdir()
iommu/exynos: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/tegra: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/msm: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/omap: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/vt-d: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/amd: Implement DOMAIN_ATTR_GEOMETRY attribute
...
Host bridge hotplug
- Add MMCONFIG support for hot-added host bridges (Jiang Liu)
Device hotplug
- Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
- Call FINAL fixups for hot-added devices, too (Myron Stowe)
- Factor out generic code for P2P bridge hot-add (Yinghai Lu)
- Remove all functions in a slot, not just those with _EJx (Amos Kong)
Dynamic resource management
- Track bus number allocation (struct resource tree per domain) (Yinghai Lu)
- Make P2P bridge 1K I/O windows work with resource reassignment (Bjorn Helgaas, Yinghai Lu)
- Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
Power management
- Add PCIe runtime D3cold support (Huang Ying)
Virtualization
- Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex Williamson)
- Add quirks for devices with broken INTx masking (Jan Kiszka)
Miscellaneous
- Fix some PCI Express capability version issues (Myron Stowe)
- Factor out some arch code with a weak, generic, pcibios_setup() (Myron Stowe)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQBy+9AAoJEPGMOI97Hn6zOpQP+wVFvA7pcteFj6HPs5nTq2Hc
55oeRqCO0wBHoFMCKB0AjeTATjqxi9OhcjaiVrZejxNyWKC9MnrXuunpQ0l/hCbR
M/TK+BCelfX2FU4eXNf+TBCCcOhOVWqQft9Gm6nYKwX8Y0msRVCceI4WwhZgSwtI
vdtmnqlwolscdnq+8ThsnvUMtwkN0gExmn2FJRl6EoEgG0DTqhMkZ83uA+NPBhvv
I+g0XbA6haaZph2nnSYR0hIW4Q7JkT/LgA6uVAQxamctwxLol7xxsjCRnfqrulkf
kaRr2fAgBXfmaOIltro4UkXrCM52ZSyggCDfExHp6mWGPKMjE5ZcyK1YbGfmmumk
DS3t1S0eBdDJXrnf9l/Yb8e95dQxRCYKelKzr1rTD9QAXsInE8rC40hvhfFaTa4s
nZYRTz0SKv6coQihqaOR7shx1DNomLFk7jndaWEElfl9/cT/nQnZ8XLfVMzkJNNB
Y4SM6zkiIaCL0aiSEE16MqVjmODYRjbURLYzQIrqr2KJQg8X6XjIRojQLjL6xEgA
22ry2ZRPhqO68g7aLqvixiSDaTp0Z0Vw+JmgjtBqvkokwZcGQtm4umkpAdOi+Es8
3bJaMY7ZUpDX53FE8iyP6AnmR/1k19rC1gNnNq/syWyjtYOYJ9i3QCTafFgvE1VC
5coQ1L5tByHvpzK5PHwf
=oo/A
-----END PGP SIGNATURE-----
Merge tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI changes from Bjorn Helgaas:
"Host bridge hotplug:
- Add MMCONFIG support for hot-added host bridges (Jiang Liu)
Device hotplug:
- Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
- Call FINAL fixups for hot-added devices, too (Myron Stowe)
- Factor out generic code for P2P bridge hot-add (Yinghai Lu)
- Remove all functions in a slot, not just those with _EJx (Amos
Kong)
Dynamic resource management:
- Track bus number allocation (struct resource tree per domain)
(Yinghai Lu)
- Make P2P bridge 1K I/O windows work with resource reassignment
(Bjorn Helgaas, Yinghai Lu)
- Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
Power management:
- Add PCIe runtime D3cold support (Huang Ying)
Virtualization:
- Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex
Williamson)
- Add quirks for devices with broken INTx masking (Jan Kiszka)
Miscellaneous:
- Fix some PCI Express capability version issues (Myron Stowe)
- Factor out some arch code with a weak, generic, pcibios_setup()
(Myron Stowe)"
* tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (122 commits)
PCI: hotplug: ensure a consistent return value in error case
PCI: fix undefined reference to 'pci_fixup_final_inited'
PCI: build resource code for M68K architecture
PCI: pciehp: remove unused pciehp_get_max_lnk_width(), pciehp_get_cur_lnk_width()
PCI: reorder __pci_assign_resource() (no change)
PCI: fix truncation of resource size to 32 bits
PCI: acpiphp: merge acpiphp_debug and debug
PCI: acpiphp: remove unused res_lock
sparc/PCI: replace pci_cfg_fake_ranges() with pci_read_bridge_bases()
PCI: call final fixups hot-added devices
PCI: move final fixups from __init to __devinit
x86/PCI: move final fixups from __init to __devinit
MIPS/PCI: move final fixups from __init to __devinit
PCI: support sizing P2P bridge I/O windows with 1K granularity
PCI: reimplement P2P bridge 1K I/O windows (Intel P64H2)
PCI: disable MEM decoding while updating 64-bit MEM BARs
PCI: leave MEM and IO decoding disabled during 64-bit BAR sizing, too
PCI: never discard enable/suspend/resume_early/resume fixups
PCI: release temporary reference in __nv_msi_ht_cap_quirk()
PCI: restructure 'pci_do_fixups()'
...
Pull trivial tree from Jiri Kosina:
"Trivial updates all over the place as usual."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
Fix typo in include/linux/clk.h .
pci: hotplug: Fix typo in pci
iommu: Fix typo in iommu
video: Fix typo in drivers/video
Documentation: Add newline at end-of-file to files lacking one
arm,unicore32: Remove obsolete "select MISC_DEVICES"
module.c: spelling s/postition/position/g
cpufreq: Fix typo in cpufreq driver
trivial: typo in comment in mksysmap
mach-omap2: Fix typo in debug message and comment
scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
Change email address for Steve Glendinning
Btrfs: fix typo in convert_extent_bit
via: Remove bogus if check
netprio_cgroup.c: fix comment typo
backlight: fix memory leak on obscure error path
Documentation: asus-laptop.txt references an obsolete Kconfig item
Documentation: ManagementStyle: fixed typo
mm/vmscan: cleanup comment error in balance_pgdat
mm: cleanup on the comments of zone_reclaim_stat
...
* Performance improvement to lower the amount of traps the hypervisor
has to do 32-bit guests. Mainly for setting PTE entries and updating
TLS descriptors.
* MCE polling driver to collect hypervisor MCE buffer and present them to
/dev/mcelog.
* Physical CPU online/offline support. When an privileged guest is booted
it is present with virtual CPUs, which might have an 1:1 to physical
CPUs but usually don't. This provides mechanism to offline/online physical
CPUs.
Bug-fixes for:
* Coverity found fixes in the console and ACPI processor driver.
* PVonHVM kexec fixes along with some cleanups.
* Pages that fall within E820 gaps and non-RAM regions (and had been
released to hypervisor) would be populated back, but potentially in
non-RAM regions.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJQDWcvAAoJEFjIrFwIi8fJ6GAH/iFIkOC5wseD8qZ9nV4VI46t
0GYvBFC4F91NvC7CNfoAySr84v+ZORIZzMcdyDF8H/tLO9MaOY/Mwn0S5ZSqmYMi
rhskvK3InBaVkYtceOHugNGM7mB0c3STIm7OsjW6gbVzohmTN25rbQR+X5iWAtVA
cTUtDyH3AU15mwuVT3U+VC4IulHpnNJz4pHoq3Sn61/UK1LYmhLXYd5fveA0D0B8
lRZTAvNMsYDJDDmkWNrs8RczKkQ86DTSjfGawm0YG+Gf94GgD5yMHWbiHh2Gy93e
u7sHK0RrKbP5BY/MV6vVJxkoV5NoWgCc0tcjBcYwdyvwzxDS75UhV6uoVHC3Ao8=
=drt2
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen update from Konrad Rzeszutek Wilk:
"Features:
* Performance improvement to lower the amount of traps the hypervisor
has to do 32-bit guests. Mainly for setting PTE entries and
updating TLS descriptors.
* MCE polling driver to collect hypervisor MCE buffer and present
them to /dev/mcelog.
* Physical CPU online/offline support. When an privileged guest is
booted it is present with virtual CPUs, which might have an 1:1 to
physical CPUs but usually don't. This provides mechanism to
offline/online physical CPUs.
Bug-fixes for:
* Coverity found fixes in the console and ACPI processor driver.
* PVonHVM kexec fixes along with some cleanups.
* Pages that fall within E820 gaps and non-RAM regions (and had been
released to hypervisor) would be populated back, but potentially in
non-RAM regions."
* tag 'stable/for-linus-3.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen: populate correct number of pages when across mem boundary (v2)
xen PVonHVM: move shared_info to MMIO before kexec
xen: simplify init_hvm_pv_info
xen: remove cast from HYPERVISOR_shared_info assignment
xen: enable platform-pci only in a Xen guest
xen/pv-on-hvm kexec: shutdown watches from old kernel
xen/x86: avoid updating TLS descriptors if they haven't changed
xen/x86: add desc_equal() to compare GDT descriptors
xen/mm: zero PTEs for non-present MFNs in the initial page table
xen/mm: do direct hypercall in xen_set_pte() if batching is unavailable
xen/hvc: Fix up checks when the info is allocated.
xen/acpi: Fix potential memory leak.
xen/mce: add .poll method for mcelog device driver
xen/mce: schedule a workqueue to avoid sleep in atomic context
xen/pcpu: Xen physical cpus online/offline sys interface
xen/mce: Register native mce handler as vMCE bounce back point
x86, MCE, AMD: Adjust initcall sequence for xen
xen/mce: Add mcelog support for Xen platform
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQDRDNAAoJEI7yEDeUysxlkl8P/3C2AHx2webOU8sVzhfU6ONZ
ZoGevwBjyZIeJEmiWVpFTTEew1l0PXtpyOocXGNUXIddVnhXTQOKr/Scj4uFbmx8
ROqgK8NSX9+xOGrBPCoN7SlJkmp+m6uYtwYkl2SGnsEVLWMKkc7J7oqmszCcTQvN
UXMf7G47/Ul2NUSBdv4Yvizhl4kpvWxluiweDw3E/hIQKN0uyP7CY58qcAztw8nG
csZBAnnuPFwIAWxHXW3eBBv4UP138HbNDqJ/dujjocM6GnOxmXJmcZ6b57gh+Y64
3+w9IR4qrRWnsErb/I8inKLJ1Jdcf7yV2FmxYqR4pIXay2Yzo1BsvFd6EB+JavUv
pJpixrFiDDFoQyXlh4tGpsjpqdXNMLqyG4YpqzSZ46C8naVv9gKE7SXqlXnjyDlb
Llx3hb9Fop8O5ykYEGHi+gIISAK5eETiQl4yw9RUBDpxydH4qJtqGIbLiDy8y9wi
Xyi8PBlNl+biJFsK805lxURqTp/SJTC3+Zb7A7CzYEQm5xZw3W/CKZx1ZYBfpaa/
pWaP6tB7JwgLIVXi4HQayLWqMVwH0soZIn9yazpOEFv6qO8d5QH5RAxAW2VXE3n5
JDlrajar/lGIdiBVWfwTJLb86gv3QDZtIWoR9mZuLKeKWE/6PRLe7HQpG1pJovsm
2AsN5bS0BWq+aqPpZHa5
=pECD
-----END PGP SIGNATURE-----
Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Avi Kivity:
"Highlights include
- full big real mode emulation on pre-Westmere Intel hosts (can be
disabled with emulate_invalid_guest_state=0)
- relatively small ppc and s390 updates
- PCID/INVPCID support in guests
- EOI avoidance; 3.6 guests should perform better on 3.6 hosts on
interrupt intensive workloads)
- Lockless write faults during live migration
- EPT accessed/dirty bits support for new Intel processors"
Fix up conflicts in:
- Documentation/virtual/kvm/api.txt:
Stupid subchapter numbering, added next to each other.
- arch/powerpc/kvm/booke_interrupts.S:
PPC asm changes clashing with the KVM fixes
- arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c:
Duplicated commits through the kvm tree and the s390 tree, with
subsequent edits in the KVM tree.
* tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
KVM: fix race with level interrupts
x86, hyper: fix build with !CONFIG_KVM_GUEST
Revert "apic: fix kvm build on UP without IOAPIC"
KVM guest: switch to apic_set_eoi_write, apic_write
apic: add apic_set_eoi_write for PV use
KVM: VMX: Implement PCID/INVPCID for guests with EPT
KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check
KVM: PPC: Critical interrupt emulation support
KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests
KVM: PPC64: booke: Set interrupt computation mode for 64-bit host
KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt
KVM: PPC: bookehv64: Add support for std/ld emulation.
booke: Added crit/mc exception handler for e500v2
booke/bookehv: Add host crit-watchdog exception support
KVM: MMU: document mmu-lock and fast page fault
KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
KVM: MMU: trace fast page fault
KVM: MMU: fast path of handling guest page fault
KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
KVM: MMU: fold tlb flush judgement into mmu_spte_update
...
Pull networking changes from David S Miller:
1) Remove the ipv4 routing cache. Now lookups go directly into the FIB
trie and use prebuilt routes cached there.
No more garbage collection, no more rDOS attacks on the routing
cache. Instead we now get predictable and consistent performance,
no matter what the pattern of traffic we service.
This has been almost 2 years in the making. Special thanks to
Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
have helped along the way.
I'm sure that with a change of this magnitude there will be some
kind of fallout, but such things ought the be simple to fix at this
point. Luckily I'm not European so I'll be around all of August to
fix things :-)
The major stages of this work here are each fronted by a forced
merge commit whose commit message contains a top-level description
of the motivations and implementation issues.
2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
input.
3) TCP SYN/ACK performance tweaks from Eric Dumazet.
4) Add namespace support for netfilter L4 conntrack helpers, from Gao
Feng.
5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
Yuval Mintz.
6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.
7) Support for connection tracker helpers in userspace, from Pablo
Neira Ayuso.
8) Allow userspace driven TX load balancing functions in TEAM driver,
from Jiri Pirko.
9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
embedded gotos.
10) TCP Small Queues, essentially minimize the amount of TCP data queued
up in the packet scheduler layer. Whereas the existing BQL (Byte
Queue Limits) limits the pkt_sched --> netdevice queuing levels,
this controls the TCP --> pkt_sched queueing levels.
From Eric Dumazet.
11) Reduce the number of get_page/put_page ops done on SKB fragments,
from Alexander Duyck.
12) Implement protection against blind resets in TCP (RFC 5961), from
Eric Dumazet.
13) Support the client side of TCP Fast Open, basically the ability to
send data in the SYN exchange, from Yuchung Cheng.
Basically, the sender queues up data with a sendmsg() call using
MSG_FASTOPEN, then they do the connect() which emits the queued up
fastopen data.
14) Avoid all the problems we get into in TCP when timers or PMTU events
hit a locked socket. The TCP Small Queues changes added a
tcp_release_cb() that allows us to queue work up to the
release_sock() caller, and that's what we use here too. From Eric
Dumazet.
15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
r8169: revert "add byte queue limit support".
ipv4: Change rt->rt_iif encoding.
net: Make skb->skb_iif always track skb->dev
ipv4: Prepare for change of rt->rt_iif encoding.
ipv4: Remove all RTCF_DIRECTSRC handliing.
ipv4: Really ignore ICMP address requests/replies.
decnet: Don't set RTCF_DIRECTSRC.
net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
ipv4: Remove redundant assignment
rds: set correct msg_namelen
openvswitch: potential NULL deref in sample()
tcp: dont drop MTU reduction indications
bnx2x: Add new 57840 device IDs
tcp: avoid oops in tcp_metrics and reset tcpm_stamp
niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
niu: Fix to check for dma mapping errors.
net: Fix references to out-of-scope variables in put_cmsg_compat()
net: ethernet: davinci_emac: add pm_runtime support
net: ethernet: davinci_emac: Remove unnecessary #include
...
The x86 sched power implementation has been broken forever and gets in
the way of other stuff, remove it.
[ For archaeological interest, fixing this code would require dealing
with the cross-cpu calling of these functions and more importantly, we
need to filter idle time out of the a/m-perf stuff because the ratio
will go down to 0 when idle, giving a 0 capacity which is not what
we'd want. ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/1339594110.8980.38.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on s390x
Signed-off-by: Avi Kivity <avi@redhat.com>
Pull x86/mce changes from Ingo Molnar:
"This tree improves the AMD thresholding bank code and includes a
memory fault signal handling fixlet."
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults
x86, MCE, AMD: Update copyrights and boilerplate
x86, MCE, AMD: Give proper names to the thresholding banks
x86, MCE, AMD: Make error_count read only
x86, MCE, AMD: Cleanup reading of error_count
x86, MCE, AMD: Print decimal thresholding values
x86, MCE, AMD: Move shared bank to node descriptor
x86, MCE, AMD: Remove local_allocate_... wrapper
x86, MCE, AMD: Remove shared banks sysfs linking
x86, amd_nb: Export model 0x10 and later PCI id
* ACPI conversion to PM handling based on struct dev_pm_ops.
* Conversion of a number of platform drivers to PM handling based on struct
dev_pm_ops and removal of empty legacy PM callbacks from a couple of PCI
drivers.
* Suspend-to-both for in-kernel hibernation from Bojan Smojver.
* cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti U Murthy.
* cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
* Suspend and hibernate fixes from Srivatsa S. Bhat and Colin Cross.
* Generic PM domains framework updates.
* RTC CMOS wakeup signaling update from Paul Fox.
* sparse warnings fixes from Sachin Kamat.
* Build warnings fixes for the generic PM domains framework and PM sysfs code.
* sysfs switch for printing device suspend times from Sameer Nanda.
* Documentation fix from Oskar Schirmer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQDF5eAAoJEKhOf7ml8uNsEaAP/2wg4faoOGob5A0/7tLqG3Cw
xnTmGsfL7wG07Q8ykCL1BSlBb1VeJz8L6LTmUpaABI4M//oIBlcYQKyCE0Tat1AO
9bJXFzK7qcHMhkTz6d6LDqtVzR3NGM3ypjZqj8aEXBov07LMR1AXvgNwXXhv25zM
0unwrh1XNinBN3n+oaktpWk1YHUjsa5IMU+2tQJrocuHXcgK30vGXZVrZ4g9w1c2
eS+ED1oKUqOYtFzIUX+aCtaDDheGaPlugk/GOtIB7Sae0s0vMlxH/T5ncB4SxRC+
v3s4OykqQc5Dc8+0bNlBH7ykSVNB0PoQiyKDY67CxtH+q1xQSc9/f3XJqnGMaVDE
17eZUZsL4qSyzRuCbPCGAgwBHmx3qNCMu1i1BcmnSxU+ikPUeCR7mYOP0mRThwPH
OSfs+c/vZ+Ow6CwVE4UFrbm9Jve7ADnCrlZzT2m6XjhHGyjKP7SJlzP9TPsZ0LRk
oxgQDYHmxbo50t9tBCz5L4ZTMKkDp28e78x84/CteP85srcW3GqDxrPyp2uzJu5O
tvIEBvVlc4ucq8sG83RkugQwrG/2cQwG2HO9ERAwq01HHA1BYsuU3A961Jqf5CZo
nFRSnByvVj/imPf47OWpDPAbVEs7jxufJuLEbPwGj1MkttTGDBIRu3zldXt2S6kP
Q4qYU6fDaQQHFc90pqxQ
=vC4/
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
- ACPI conversion to PM handling based on struct dev_pm_ops.
- Conversion of a number of platform drivers to PM handling based on
struct dev_pm_ops and removal of empty legacy PM callbacks from a
couple of PCI drivers.
- Suspend-to-both for in-kernel hibernation from Bojan Smojver.
- cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti
Murthy.
- cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
- Suspend and hibernate fixes from Srivatsa Bhat and Colin Cross.
- Generic PM domains framework updates.
- RTC CMOS wakeup signaling update from Paul Fox.
- sparse warnings fixes from Sachin Kamat.
- Build warnings fixes for the generic PM domains framework and PM
sysfs code.
- sysfs switch for printing device suspend times from Sameer Nanda.
- Documentation fix from Oskar Schirmer.
* tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (70 commits)
cpufreq: Fix sysfs deadlock with concurrent hotplug/frequency switch
EXYNOS: bugfix on retrieving old_index from freqs.old
PM / Sleep: call early resume handlers when suspend_noirq fails
PM / QoS: Use NULL pointer instead of plain integer in qos.c
PM / QoS: Use NULL pointer instead of plain integer in pm_qos.h
PM / Sleep: Require CAP_BLOCK_SUSPEND to use wake_lock/wake_unlock
PM / Sleep: Add missing static storage class specifiers in main.c
cpuilde / ACPI: remove time from acpi_processor_cx structure
cpuidle / ACPI: remove usage from acpi_processor_cx structure
cpuidle / ACPI : remove latency_ticks from acpi_processor_cx structure
rtc-cmos: report wakeups from interrupt handler
PM / Sleep: Fix build warning in sysfs.c for CONFIG_PM_SLEEP unset
PM / Domains: Fix build warning for CONFIG_PM_RUNTIME unset
olpc-xo15-sci: Use struct dev_pm_ops for power management
PM / Domains: Replace plain integer with NULL pointer in domain.c file
PM / Domains: Add missing static storage class specifier in domain.c file
PM / crypto / ux500: Use struct dev_pm_ops for power management
PM / IPMI: Remove empty legacy PCI PM callbacks
tpm_nsc: Use struct dev_pm_ops for power management
tpm_tis: Use struct dev_pm_ops for power management
...
Pull a x86/build change from Ingo Molnar.
This makes the default stack alignment on x86-64 be just 8, allowing for
improved code generation (it can avoid some unnecessary extra alignment
logic and use just pure push/pop sequences) and smaller stack frames.
We can't generally do SSE with 16-byte alignment issues in the kernel anyway.
* 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, gcc: Use -mpreferred-stack-boundary=3 if supported
Pull x86/uv changes from Ingo Molnar:
"UV2 BAU productization fixes.
The BAU (Broadcast Assist Unit) is SGI's fancy out of line way on UV
hardware to do TLB flushes, instead of the normal APIC IPI methods.
The commits here fix / work around hangs in their latest hardware
iteration (UV2).
My understanding is that the main purpose of the out of line
signalling channel is to improve scalability: the UV APIC hardware
glue does not handle broadcasting to many CPUs very well, and this
matters most for TLB shootdowns.
[ I don't agree with all aspects of the current approach: in hindsight
it would have been better to link the BAU at the IPI/APIC driver
level instead of the TLB shootdown level, where TLB flushes are
really just one of the uses of broadcast SMP messages. Doing that
would improve scalability in some other ways and it would also
remove a few uglies from the TLB path. It would also be nice to
push more is_uv_system() tests into proper x86_init or x86_platform
callbacks. Cliff? ]"
* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/uv: Work around UV2 BAU hangs
x86/uv: Implement UV BAU runtime enable and disable control via /proc/sgi_uv/
x86/uv: Fix the UV BAU destination timeout period
Pull x86/reboot changes from Ingo Molnar:
"Now that the revampted x86 real-mode trampoline code is upstream and
seems to be working well, we can extend the 64-bit reboot code to be
as capable as the 32-bit one."
* 'x86-reboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, reboot: Be more paranoid in 64-bit reboot=bios
x86, reboot: Drop redundant write of reboot_mode
x86-64, reboot: Allow reboot=bios and reboot-cpu override on x86-64
Pull x86 platform changes from Ingo Molnar:
"This tree mostly involves various APIC driver cleanups/robustization,
and vSMP motivated platform callback improvements/cleanups"
Fix up trivial conflict due to printk cleanup right next to return value
change.
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
Revert "x86/early_printk: Replace obsolete simple_strtoul() usage with kstrtoint()"
x86/apic/x2apic: Use multiple cluster members for the irq destination only with the explicit affinity
x86/apic/x2apic: Limit the vector reservation to the user specified mask
x86/apic: Optimize cpu traversal in __assign_irq_vector() using domain membership
x86/vsmp: Fix vector_allocation_domain's return value
irq/apic: Use config_enabled(CONFIG_SMP) checks to clean up irq_set_affinity() for UP
x86/vsmp: Fix linker error when CONFIG_PROC_FS is not set
x86/apic/es7000: Make apicid of a cluster (not CPU) from a cpumask
x86/apic/es7000+summit: Always make valid apicid from a cpumask
x86/apic/es7000+summit: Fix compile warning in cpu_mask_to_apicid()
x86/apic: Fix ugly casting and branching in cpu_mask_to_apicid_and()
x86/apic: Eliminate cpu_mask_to_apicid() operation
x86/x2apic/cluster: Vector_allocation_domain() should return a value
x86/apic/irq_remap: Silence a bogus pr_err()
x86/vsmp: Ignore IOAPIC IRQ affinity if possible
x86/apic: Make cpu_mask_to_apicid() operations check cpu_online_mask
x86/apic: Make cpu_mask_to_apicid() operations return error code
x86/apic: Avoid useless scanning thru a cpumask in assign_irq_vector()
x86/apic: Try to spread IRQ vectors to different priority levels
x86/apic: Factor out default vector_allocation_domain() operation
...
Pull debug-for-linus git tree from Ingo Molnar.
Fix up trivial conflict in arch/x86/kernel/cpu/perf_event_intel.c due to
a printk() having changed to a pr_info() differently in the two branches.
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Move call to print_modules() out of show_regs()
x86/mm: Mark free_initrd_mem() as __init
x86/microcode: Mark microcode_id[] as __initconst
x86/nmi: Clean up register_nmi_handler() usage
x86: Save cr2 in NMI in case NMIs take a page fault (for i386)
x86: Remove cmpxchg from i386 NMI nesting code
x86: Save cr2 in NMI in case NMIs take a page fault
x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level>
Pull x86/asm changes from Ingo Molnar:
"Assorted single-commit improvements, as usual"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/mtrr: Slightly simplify print_mtrr_state()
x86/mm/mtrr: Fix alignment determination in range_to_mtrr()
x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
x86/alternatives: Use atomic_xchg() instead atomic_dec_and_test() for stop_machine_text_poke()
This reverts commit fbd24153c4.
This commit is subtly buggy: kstrto*int() can return an error but
it's not checked in every path. simple_strtoul() on the other hand
could not fail, so this patch subtly intruduces new failure modes.
Signed-off-by: Shuah Khan <shuahkhan@gmail.com>
Link: http://lkml.kernel.org/r/1338424803.3569.5.camel@lorien2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_EARLY_PRINTK is the only feature that might use command line
parsing in the decompression stage. If it is disabled then we can
exclude the related code to save space. This can result in an estimated
space savings of 2240 bytes from the compressed kernel image.
Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Link: http://lkml.kernel.org/r/1342746282-28497-8-git-send-email-jmillenbach@gmail.com
Signed-off-by: Gokul Caushik <caushik1@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Removes early_serial_console.c code if we don't have the config option that
enables it (EARLY_PRINTK). When disabling this code, make early_serial_base a
constant 0 to allow the compiler to optimize away the code that checks for
early_serial_base.
Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Link: http://lkml.kernel.org/r/1342746282-28497-7-git-send-email-jmillenbach@gmail.com
Signed-off-by: Gokul Caushik <caushik1@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
As we're no longer using the flag we don't need to extract the value from the
command line and store it. This is a step towards removing command line
parameter code.
Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Link: http://lkml.kernel.org/r/1342746282-28497-6-git-send-email-jmillenbach@gmail.com
Signed-off-by: Gokul Caushik <caushik1@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Changed putstr flagging from parameter to conditional compilation for puts,
debug_putstr, and error_putstr. This allows for space savings since most
configurations won't use this feature.
Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Link: http://lkml.kernel.org/r/1342746282-28497-5-git-send-email-jmillenbach@gmail.com
Signed-off-by: Gokul Caushik <caushik1@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
For consistency we changed the error output path to match the new debug path.
Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Link: http://lkml.kernel.org/r/1342746282-28497-4-git-send-email-jmillenbach@gmail.com
Signed-off-by: Gokul Caushik <caushik1@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Change all instances of if (debug) putstr(...) to a new debug_putstr(...).
This allows a future change to conditionally stub out debug_putstr to save
space.
Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Link: http://lkml.kernel.org/r/1342746282-28497-3-git-send-email-jmillenbach@gmail.com
Signed-off-by: Gokul Caushik <caushik1@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There are only 3 uses of the quiet flag and they all protect output that
is only useful for debugging the stub, therefore we switched to using the
debug flag for all extra output.
Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Link: http://lkml.kernel.org/r/1342746282-28497-2-git-send-email-jmillenbach@gmail.com
Signed-off-by: Gokul Caushik <caushik1@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
As things currently stand, traditional EFI boot loaders and the EFI
boot stub are carrying essentially the same initialisation code
required to setup an EFI machine for booting a kernel. There's really
no need to have this code in two places and the hope is that, with
this new protocol, initialisation and booting of the kernel can be
left solely to the kernel's EFI boot stub. The responsibilities of the
boot loader then become,
o Loading the kernel image from boot media
File system code still needs to be carried by boot loaders for the
scenario where the kernel and initrd files reside on a file system
that the EFI firmware doesn't natively understand, such as ext4, etc.
o Providing a user interface
Boot loaders still need to display any menus/interfaces, for example
to allow the user to select from a list of kernels.
Bump the boot protocol number because we added the 'handover_offset'
field to indicate the location of the handover protocol entry point.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Jones <pjones@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Acked-and-Tested-by: Matthew Garrett <mjg@redhat.com>
Link: http://lkml.kernel.org/r/1342689828-16815-1-git-send-email-matt@console-pimps.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The incompatible parameter of flush_tlb_mm_range cause build warning.
Fix it by correct parameter.
Ingo Molnar found that this could also cause a user space crash.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1342747103-19765-1-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add the RDSEED and ADX features documented in section 9.1 of the Intel
Architecture Instruction Set Extensions Programming Reference,
document 319433, version 013b, available from
http://software.intel.com/en-us/avx/
The PREFETCHW bit is already supported in Linux under the name
3DNOWPREFETCH.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-lgr6482ufk1bvxzvc2hr8qbp@git.kernel.org
Use PIC_NUM_PINS instead of hard-coded 16 for pic pins.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When more than 1 source id is in use for the same GSI, we have the
following race related to handling irq_states race:
CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1.
CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0).
Now ioapic thinks the level is 0 but irq_state is not 0.
Fix by performing all irq_states bitmap handling under pic/ioapic lock.
This also removes the need for atomics with irq_states handling.
Reported-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The parameter, 'kvm', is not used in gfn_to_pfn_memslot, we can happily remove
it
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Using get_fault_pfn to cleanup the code
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It will trigger a WARN_ON if the page has been freed but it is still
used in mmu, it can help us to detect mm bug early
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently kexec in a PVonHVM guest fails with a triple fault because the
new kernel overwrites the shared info page. The exact failure depends on
the size of the kernel image. This patch moves the pfn from RAM into
MMIO space before the kexec boot.
The pfn containing the shared_info is located somewhere in RAM. This
will cause trouble if the current kernel is doing a kexec boot into a
new kernel. The new kernel (and its startup code) can not know where the
pfn is, so it can not reserve the page. The hypervisor will continue to
update the pfn, and as a result memory corruption occours in the new
kernel.
One way to work around this issue is to allocate a page in the
xen-platform pci device's BAR memory range. But pci init is done very
late and the shared_info page is already in use very early to read the
pvclock. So moving the pfn from RAM to MMIO is racy because some code
paths on other vcpus could access the pfn during the small window when
the old pfn is moved to the new pfn. There is even a small window were
the old pfn is not backed by a mfn, and during that time all reads
return -1.
Because it is not known upfront where the MMIO region is located it can
not be used right from the start in xen_hvm_init_shared_info.
To minimise trouble the move of the pfn is done shortly before kexec.
This does not eliminate the race because all vcpus are still online when
the syscore_ops will be called. But hopefully there is no work pending
at this point in time. Also the syscore_op is run last which reduces the
risk further.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
init_hvm_pv_info is called only in PVonHVM context, move it into ifdef.
init_hvm_pv_info does not fail, make it a void function.
remove arguments from init_hvm_pv_info because they are not used by the
caller.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Both have type struct shared_info so no cast is needed.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When switching tasks in a Xen PV guest, avoid updating the TLS
descriptors if they haven't changed. This improves the speed of
context switches by almost 10% as much of the time the descriptors are
the same or only one is different.
The descriptors written into the GDT by Xen are modified from the
values passed in the update_descriptor hypercall so we keep shadow
copies of the three TLS descriptors to compare against.
lmbench3 test Before After Improvement
--------------------------------------------
lat_ctx -s 32 24 7.19 6.52 9%
lat_pipe 12.56 11.66 7%
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When constructing the initial page tables, if the MFN for a usable PFN
is missing in the p2m then that frame is initially ballooned out. In
this case, zero the PTE (as in decrease_reservation() in
drivers/xen/balloon.c).
This is obviously safe instead of having an valid PTE with an MFN of
INVALID_P2M_ENTRY (~0).
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In xen_set_pte() if batching is unavailable (because the caller is in
an interrupt context such as handling a page fault) it would fall back
to using native_set_pte() and trapping and emulating the PTE write.
On 32-bit guests this requires two traps for each PTE write (one for
each dword of the PTE). Instead, do one mmu_update hypercall
directly.
During construction of the initial page tables, continue to use
native_set_pte() because most of the PTEs being set are in writable
and unpinned pages (see phys_pmd_init() in arch/x86/mm/init_64.c) and
using a hypercall for this is very expensive.
This significantly improves page fault performance in 32-bit PV
guests.
lmbench3 test Before After Improvement
----------------------------------------------
lat_pagefault 3.18 us 2.32 us 27%
lat_proc fork 356 us 313.3 us 11%
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When Xen hypervisor inject vMCE to guest, use native mce handler
to handle it
Signed-off-by: Ke, Liping <liping.ke@intel.com>
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
there are 3 funcs which need to be _initcalled in a logic sequence:
1. xen_late_init_mcelog
2. mcheck_init_device
3. threshold_init_device
xen_late_init_mcelog must register xen_mce_chrdev_device before
native mce_chrdev_device registration if running under xen platform;
mcheck_init_device should be inited before threshold_init_device to
initialize mce_device, otherwise a a NULL ptr dereference will cause panic.
so we use following _initcalls
1. device_initcall(xen_late_init_mcelog);
2. device_initcall_sync(mcheck_init_device);
3. late_initcall(threshold_init_device);
when running under xen, the initcall order is 1,2,3;
on baremetal, we skip 1 and we do only 2 and 3.
Acked-and-tested-by: Borislav Petkov <bp@amd64.org>
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>