linux/arch/x86/Kconfig

2759 lines
89 KiB
Plaintext
Raw Normal View History

# Select 32 or 64 bit
config 64BIT
x86: simplify "make ARCH=x86" and fix kconfig all.config Simplify "make ARCH=x86" and fix kconfig so we again can set 64BIT in all.config. For a fix the diffstat is nice: 6 files changed, 3 insertions(+), 36 deletions(-) The patch reverts these commits: - 0f855aa64b3f63d35a891510cf7db932a435c116 ("kconfig: add helper to set config symbol from environment variable") - 2a113281f5cd2febbab21a93c8943f8d3eece4d3 ("kconfig: use $K64BIT to set 64BIT with all*config targets") Roman Zippel pointed out that kconfig supported string compares so the additional complexity introduced by the above two patches were not needed. With this patch we have following behaviour: # make {allno,allyes,allmod,rand}config [ARCH=...] option \ host arch | 32bit | 64bit ===================================================== ./. | 32bit | 64bit ARCH=x86 | 32bit | 32bit ARCH=i386 | 32bit | 32bit ARCH=x86_64 | 64bit | 64bit The general rule are that ARCH= and native architecture takes precedence over the configuration. So make ARCH=i386 [whatever] will always build a 32-bit kernel no matter what the configuration says. The configuration will be updated to 32-bit if it was configured to 64-bit and the other way around. This behaviour is consistent with previous behaviour so no suprises here. make ARCH=x86 will per default result in a 32-bit kernel but as the only ARCH= value x86 allow the user to select between 32-bit and 64-bit using menuconfig. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Andreas Herrmann <aherrman@arcor.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-17 22:37:31 +08:00
bool "64-bit kernel" if ARCH = "x86"
x86: Default to ARCH=x86 to avoid overriding CONFIG_64BIT It is easy to waste a bunch of time when one takes a 32-bit .config from a test machine and try to build it on a faster 64-bit system, and its existing setting of CONFIG_64BIT=n gets *changed* to match the build host. Similarly, if one has an existing build tree it is easy to trash an entire build tree that way. This is because the default setting for $ARCH when discovered from 'uname' is one of the legacy pre-x86-merge values (i386 or x86_64), which effectively force the setting of CONFIG_64BIT to match. We should default to ARCH=x86 instead, finally completing the merge that we started so long ago. This patch preserves the behaviour of the legacy ARCH settings for commands such as: make ARCH=x86_64 randconfig make ARCH=i386 randconfig ... since making the value of CONFIG_64BIT actually random in that situation is not desirable. In time, perhaps we can retire this legacy use of the old ARCH= values. We already have a way to override values for *any* config option, using $KCONFIG_ALLCONFIG, so it could be argued that we don't necessarily need to keep ARCH={i386,x86_64} around as a special case just for overriding CONFIG_64BIT. We'd probably at least want to add a way to override config options from the command line ('make CONFIG_FOO=y oldconfig') before we talk about doing that though. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Link: http://lkml.kernel.org/r/1356040315.3198.51.camel@shinybook.infradead.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-12-21 05:51:55 +08:00
default ARCH != "i386"
---help---
Say yes to build a 64-bit kernel - formerly known as x86_64
Say no to build a 32-bit kernel - formerly known as i386
config X86_32
def_bool y
depends on !64BIT
config X86_64
def_bool y
depends on 64BIT
### Arch settings
config X86
def_bool y
select ACPI_LEGACY_TABLES_LOOKUP if ACPI
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
select ANON_INODES
select ARCH_CLOCKSOURCE_DATA
select ARCH_DISCARD_MEMBLOCK
select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
Kconfig: consolidate CONFIG_DEBUG_STRICT_USER_COPY_CHECKS The help text for this config is duplicated across the x86, parisc, and s390 Kconfig.debug files. Arnd Bergman noted that the help text was slightly misleading and should be fixed to state that enabling this option isn't a problem when using pre 4.4 gcc. To simplify the rewording, consolidate the text into lib/Kconfig.debug and modify it there to be more explicit about when you should say N to this config. Also, make the text a bit more generic by stating that this option enables compile time checks so we can cover architectures which emit warnings vs. ones which emit errors. The details of how an architecture decided to implement the checks isn't as important as the concept of compile time checking of copy_from_user() calls. While we're doing this, remove all the copy_from_user_overflow() code that's duplicated many times and place it into lib/ so that any architecture supporting this option can get the function for free. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Helge Deller <deller@gmx.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01 06:28:42 +08:00
select ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
kernel: add kcov code coverage kcov provides code coverage collection for coverage-guided fuzzing (randomized testing). Coverage-guided fuzzing is a testing technique that uses coverage feedback to determine new interesting inputs to a system. A notable user-space example is AFL (http://lcamtuf.coredump.cx/afl/). However, this technique is not widely used for kernel testing due to missing compiler and kernel support. kcov does not aim to collect as much coverage as possible. It aims to collect more or less stable coverage that is function of syscall inputs. To achieve this goal it does not collect coverage in soft/hard interrupts and instrumentation of some inherently non-deterministic or non-interesting parts of kernel is disbled (e.g. scheduler, locking). Currently there is a single coverage collection mode (tracing), but the API anticipates additional collection modes. Initially I also implemented a second mode which exposes coverage in a fixed-size hash table of counters (what Quentin used in his original patch). I've dropped the second mode for simplicity. This patch adds the necessary support on kernel side. The complimentary compiler support was added in gcc revision 231296. We've used this support to build syzkaller system call fuzzer, which has found 90 kernel bugs in just 2 months: https://github.com/google/syzkaller/wiki/Found-Bugs We've also found 30+ bugs in our internal systems with syzkaller. Another (yet unexplored) direction where kcov coverage would greatly help is more traditional "blob mutation". For example, mounting a random blob as a filesystem, or receiving a random blob over wire. Why not gcov. Typical fuzzing loop looks as follows: (1) reset coverage, (2) execute a bit of code, (3) collect coverage, repeat. A typical coverage can be just a dozen of basic blocks (e.g. an invalid input). In such context gcov becomes prohibitively expensive as reset/collect coverage steps depend on total number of basic blocks/edges in program (in case of kernel it is about 2M). Cost of kcov depends only on number of executed basic blocks/edges. On top of that, kernel requires per-thread coverage because there are always background threads and unrelated processes that also produce coverage. With inlined gcov instrumentation per-thread coverage is not possible. kcov exposes kernel PCs and control flow to user-space which is insecure. But debugfs should not be mapped as user accessible. Based on a patch by Quentin Casasnovas. [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode'] [akpm@linux-foundation.org: unbreak allmodconfig] [akpm@linux-foundation.org: follow x86 Makefile layout standards] Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: syzkaller <syzkaller@googlegroups.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Tavis Ormandy <taviso@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@google.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Drysdale <drysdale@google.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-23 05:27:30 +08:00
select ARCH_HAS_KCOV if X86_64
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB Given that a write-back (WB) mapping plus non-temporal stores is expected to be the most efficient way to access PMEM, update the definition of ARCH_HAS_PMEM_API to imply arch support for WB-mapped-PMEM. This is needed as a pre-requisite for adding PMEM to the direct map and mapping it with struct page. The above clarification for X86_64 means that memcpy_to_pmem() is permitted to use the non-temporal arch_memcpy_to_pmem() rather than needlessly fall back to default_memcpy_to_pmem() when the pcommit instruction is not available. When arch_memcpy_to_pmem() is not guaranteed to flush writes out of cache, i.e. on older X86_32 implementations where non-temporal stores may just dirty cache, ARCH_HAS_PMEM_API is simply disabled. The default fall back for persistent memory handling remains. Namely, map it with the WT (write-through) cache-type and hope for the best. arch_has_pmem_api() is updated to only indicate whether the arch provides the proper helpers to meet the minimum "writes are visible outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()". Code that cares whether wmb_pmem() actually flushes writes to pmem must now call arch_has_wmb_pmem() directly. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> [hch: set ARCH_HAS_PMEM_API=n on x86_32] Reviewed-by: Christoph Hellwig <hch@lst.de> [toshi: x86_32 compile fixes] Signed-off-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-25 06:29:38 +08:00
select ARCH_HAS_PMEM_API if X86_64
nd_blk: change aperture mapping from WC to WB This should result in a pretty sizeable performance gain for reads. For rough comparison I did some simple read testing using PMEM to compare reads of write combining (WC) mappings vs write-back (WB). This was done on a random lab machine. PMEM reads from a write combining mapping: # dd of=/dev/null if=/dev/pmem0 bs=4096 count=100000 100000+0 records in 100000+0 records out 409600000 bytes (410 MB) copied, 9.2855 s, 44.1 MB/s PMEM reads from a write-back mapping: # dd of=/dev/null if=/dev/pmem0 bs=4096 count=1000000 1000000+0 records in 1000000+0 records out 4096000000 bytes (4.1 GB) copied, 3.44034 s, 1.2 GB/s To be able to safely support a write-back aperture I needed to add support for the "read flush" _DSM flag, as outlined in the DSM spec: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf This flag tells the ND BLK driver that it needs to flush the cache lines associated with the aperture after the aperture is moved but before any new data is read. This ensures that any stale cache lines from the previous contents of the aperture will be discarded from the processor cache, and the new data will be read properly from the DIMM. We know that the cache lines are clean and will be discarded without any writeback because either a) the previous aperture operation was a read, and we never modified the contents of the aperture, or b) the previous aperture operation was a write and we must have written back the dirtied contents of the aperture to the DIMM before the I/O was completed. In order to add support for the "read flush" flag I needed to add a generic routine to invalidate cache lines, mmio_flush_range(). This is protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently only supported on x86. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-28 03:14:20 +08:00
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_SG_CHAIN
UBSAN: run-time undefined behavior sanity checker UBSAN uses compile-time instrumentation to catch undefined behavior (UB). Compiler inserts code that perform certain kinds of checks before operations that could cause UB. If check fails (i.e. UB detected) __ubsan_handle_* function called to print error message. So the most of the work is done by compiler. This patch just implements ubsan handlers printing errors. GCC has this capability since 4.9.x [1] (see -fsanitize=undefined option and its suboptions). However GCC 5.x has more checkers implemented [2]. Article [3] has a bit more details about UBSAN in the GCC. [1] - https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html [2] - https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html [3] - http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/ Issues which UBSAN has found thus far are: Found bugs: * out-of-bounds access - 97840cb67ff5 ("netfilter: nfnetlink: fix insufficient validation in nfnetlink_bind") undefined shifts: * d48458d4a768 ("jbd2: use a better hash function for the revoke table") * 10632008b9e1 ("clockevents: Prevent shift out of bounds") * 'x << -1' shift in ext4 - http://lkml.kernel.org/r/<5444EF21.8020501@samsung.com> * undefined rol32(0) - http://lkml.kernel.org/r/<1449198241-20654-1-git-send-email-sasha.levin@oracle.com> * undefined dirty_ratelimit calculation - http://lkml.kernel.org/r/<566594E2.3050306@odin.com> * undefined roundown_pow_of_two(0) - http://lkml.kernel.org/r/<1449156616-11474-1-git-send-email-sasha.levin@oracle.com> * [WONTFIX] undefined shift in __bpf_prog_run - http://lkml.kernel.org/r/<CACT4Y+ZxoR3UjLgcNdUm4fECLMx2VdtfrENMtRRCdgHB2n0bJA@mail.gmail.com> WONTFIX here because it should be fixed in bpf program, not in kernel. signed overflows: * 32a8df4e0b33f ("sched: Fix odd values in effective_load() calculations") * mul overflow in ntp - http://lkml.kernel.org/r/<1449175608-1146-1-git-send-email-sasha.levin@oracle.com> * incorrect conversion into rtc_time in rtc_time64_to_tm() - http://lkml.kernel.org/r/<1449187944-11730-1-git-send-email-sasha.levin@oracle.com> * unvalidated timespec in io_getevents() - http://lkml.kernel.org/r/<CACT4Y+bBxVYLQ6LtOKrKtnLthqLHcw-BMp3aqP3mjdAvr9FULQ@mail.gmail.com> * [NOTABUG] signed overflow in ktime_add_safe() - http://lkml.kernel.org/r/<CACT4Y+aJ4muRnWxsUe1CMnA6P8nooO33kwG-c8YZg=0Xc8rJqw@mail.gmail.com> [akpm@linux-foundation.org: fix unused local warning] [akpm@linux-foundation.org: fix __int128 build woes] Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Marek <mmarek@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yury Gribov <y.gribov@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-21 07:00:55 +08:00
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
select ARCH_SUPPORTS_INT128 if X86_64
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if X86_64
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
mm: send one IPI per CPU to TLB flush all entries after unmapping pages An IPI is sent to flush remote TLBs when a page is unmapped that was potentially accesssed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a simple structure that tracks CPUs that potentially have TLB entries for pages being unmapped. When the unmapping is complete, the full TLB is flushed on the assumption that a refill cost is lower than flushing individual entries. Architectures wishing to do this must give the following guarantee. If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that linear address from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The architecture should consider whether the cost of the full TLB flush is higher than sending an IPI to flush each individual entry. An additional architecture helper called flush_tlb_local is required. It's a trivial wrapper with some accounting in the x86 case. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%) Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%) Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 581.00 611.43 System 5804.93 4111.76 Elapsed 161.03 122.12 This is showing that the readers completed 24.40% faster with 29% less system CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is lower on a single socket machine. 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%) Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%) Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 58.09 57.64 System 111.82 76.56 Elapsed 27.29 22.55 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. It will have an unpredictable impact on the workload running on the CPU being flushed as it'll depend on how many TLB entries need to be refilled and how long that takes. Worst case, the TLB will be completely cleared of active entries when the target PFNs were not resident at all. [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 06:47:32 +08:00
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_IPC_PARSE_VERSION if X86_32
select ARCH_WANT_OPTIONAL_GPIOLIB
select BUILDTIME_EXTABLE_SORT
select CLKEVT_I8253
select CLKSRC_I8253 if X86_32
select CLOCKSOURCE_VALIDATE_LAST_CYCLE
select CLOCKSOURCE_WATCHDOG
select CLONE_BACKWARDS if X86_32
select COMPAT_OLD_SIGACTION if IA32_EMULATION
select DCACHE_WORD_ACCESS
EDAC changes, v2: * New APM X-Gene SoC EDAC driver (Loc Ho) * AMD error injection module improvements (Aravind Gopalakrishnan) * Altera Arria 10 support (Thor Thayer) * misc fixes and cleanups all over the place -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJViuInAAoJEBLB8Bhh3lVKHT8QAKkHIMreO8obo09haxNJlfdF BaG7SNEDhvcgQ1B76RsjnjkUpsivvUt+mCYMP+BxcAqFrTA33UZCCOK5tEhGb1wr matRdR6+aezqAl2e/0/Ti25bWOkDxcOeazh2TyezuyIXtaJjOq1oZC7OaYGmxPun NlZY+/uY1eiHlewKsK04y8G8J5i4wGoKnuxBvOyELT90+a+fLfAOshAp0D4r0piB Znv0ydsHlu+Wx57slg1DktlsyswmcGS9WfWwwTlELOLulKgN8wEAVYzUB5pJzNbz ehq0J4wYz95juXADC4M4tEjErHVJNl6PbyMqwt0+XUUJ1NSgOj7Q6iqwxDoZX8km oxiLVydQBtoIzF1LojFKAVZDFnrMKHKwK3RaDaUJjTI90+tVzEU8xsBlUf6+EgD2 Ss2RH8Gfuf52RdtwHh9++T1ur5rM9YNCAm31msq06mcOf0bEtmDbhZ+fVC5mjhqB fIb3hxnk0r2BVg+ZCN/boxGS6RzUtYVcCXaBPDMeHcg9BEEds70KCFEcsX7TvJIg 5/SHI+033MylqkX2zrgDQLj7CQk3R0jaotHVbdhLupyOldcM7r5uF+VO84drNWGN GfM2lpyE/swZWnzKuotgYIGR1XvFjtJAVAyNGIvwP+ajjTsqXzEnLSLClY5LWfYd nSSSMpCCqsEmhoWftOix =Id4f -----END PGP SIGNATURE----- Merge tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp Pull EDAC updates from Borislav Petkov: - New APM X-Gene SoC EDAC driver (Loc Ho) - AMD error injection module improvements (Aravind Gopalakrishnan) - Altera Arria 10 support (Thor Thayer) - misc fixes and cleanups all over the place * tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (28 commits) EDAC: Update Documentation/edac.txt EDAC: Fix typos in Documentation/edac.txt EDAC, mce_amd_inj: Set MISCV on injection EDAC, mce_amd_inj: Move bit preparations before the injection EDAC, mce_amd_inj: Cleanup and simplify README EDAC, altera: Do not allow suspend when EDAC is enabled EDAC, mce_amd_inj: Make inj_type static arm: socfpga: dts: Add Arria10 SDRAM EDAC DTS support EDAC, altera: Add Arria10 EDAC support EDAC, altera: Refactor for Altera CycloneV SoC EDAC, altera: Generalize driver to use DT Memory size EDAC, mce_amd_inj: Add README file EDAC, mce_amd_inj: Add individual permissions field to dfs_node EDAC, mce_amd_inj: Modify flags attribute to use string arguments EDAC, mce_amd_inj: Read out number of MCE banks from the hardware EDAC, mce_amd_inj: Use MCE_INJECT_GET macro for bank node too EDAC, xgene: Fix cpuid abuse EDAC, mpc85xx: Extend error address to 64 bit EDAC, mpc8xxx: Adapt for FSL SoC EDAC, edac_stub: Drop arch-specific include ...
2015-06-25 10:52:06 +08:00
select EDAC_ATOMIC_SCRUB
select EDAC_SUPPORT
select GENERIC_CLOCKEVENTS
select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 || (X86_32 && X86_LOCAL_APIC)
select GENERIC_CLOCKEVENTS_MIN_ADJUST
select GENERIC_CMOS_UPDATE
select GENERIC_CPU_AUTOPROBE
select GENERIC_EARLY_IOREMAP
select GENERIC_FIND_FIRST_BIT
select GENERIC_IOMAP
select GENERIC_IRQ_PROBE
select GENERIC_IRQ_SHOW
select GENERIC_PENDING_IRQ if SMP
select GENERIC_SMP_IDLE_THREAD
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select GENERIC_TIME_VSYSCALL
select HAVE_ACPI_APEI if ACPI
select HAVE_ACPI_APEI_NMI if ACPI
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
select HAVE_AOUT if X86_32
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KASAN if X86_64 && SPARSEMEM_VMEMMAP
select HAVE_ARCH_KGDB
select HAVE_ARCH_KMEMCHECK
x86: mm: support ARCH_MMAP_RND_BITS x86: arch_mmap_rnd() uses hard-coded values, 8 for 32-bit and 28 for 64-bit, to generate the random offset for the mmap base address. This value represents a compromise between increased ASLR effectiveness and avoiding address-space fragmentation. Replace it with a Kconfig option, which is sensibly bounded, so that platform developers may choose where to place this compromise. Keep default values as new minimums. Signed-off-by: Daniel Cashman <dcashman@google.com> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Nick Kralevich <nnk@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Borislav Petkov <bp@suse.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 07:20:06 +08:00
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT
select HAVE_ARCH_SECCOMP_FILTER
select HAVE_ARCH_SOFT_DIRTY if X86_64
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_BPF_JIT if X86_64
select HAVE_CC_STACKPROTECTOR
select HAVE_CMPXCHG_DOUBLE
select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKING if X86_64
select HAVE_COPY_THREAD_TLS
select HAVE_C_RECORDMCOUNT
select HAVE_DEBUG_KMEMLEAK
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_DMA_API_DEBUG
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_FENTRY if X86_64
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_FP_TEST
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
select HAVE_GENERIC_DMA_COHERENT if X86_32
select HAVE_HW_BREAKPOINT
select HAVE_IDE
select HAVE_IOREMAP_PROT
select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
select HAVE_IRQ_TIME_ACCOUNTING
select HAVE_KERNEL_BZIP2
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZ4
select HAVE_KERNEL_LZMA
select HAVE_KERNEL_LZO
select HAVE_KERNEL_XZ
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
select HAVE_KRETPROBES
select HAVE_KVM
select HAVE_LIVEPATCH if X86_64
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_OPROFILE
select HAVE_OPTPROBES
select HAVE_PCSPKR_PLATFORM
select HAVE_PERF_EVENTS
select HAVE_PERF_EVENTS_NMI
perf: Unified API to record selective sets of arch registers This brings a new API to help the selective dump of registers on event sampling, and its implementation for x86 arch. Added HAVE_PERF_REGS config option to determine if the architecture provides perf registers ABI. The information about desired registers will be passed in u64 mask. It's up to the architecture to map the registers into the mask bits. For the x86 arch implementation, both 32 and 64 bit registers bits are defined within single enum to ensure 64 bit system can provide register dump for compat task if needed in the future. Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com> [ Added missing linux/errno.h include ] Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Arun Sharma <asharma@fb.com> Cc: Benjamin Redelings <benjamin.redelings@nescent.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Ulrich Drepper <drepper@gmail.com> Link: http://lkml.kernel.org/r/1344345647-11536-2-git-send-email-jolsa@redhat.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2012-08-07 21:20:36 +08:00
select HAVE_PERF_REGS
2012-08-07 21:20:40 +08:00
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UID16 if X86_32 || IA32_EMULATION
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_USER_RETURN_NOTIFIER
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA if X86_64
select MODULES_USE_ELF_REL if X86_32
select OLD_SIGACTION if X86_32
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select PERF_EVENTS
select RTC_LIB
select SPARSE_IRQ
select SRCU
select SYSCTL_EXCEPTION_TRACE
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_DEV_DMA_OPS if X86_64
select X86_FEATURE_NAMES if PROC_FS
select HAVE_STACK_VALIDATION if X86_64
mm/core, x86/mm/pkeys: Store protection bits in high VMA flags vma->vm_flags is an 'unsigned long', so has space for 32 flags on 32-bit architectures. The high 32 bits are unused on 64-bit platforms. We've steered away from using the unused high VMA bits for things because we would have difficulty supporting it on 32-bit. Protection Keys are not available in 32-bit mode, so there is no concern about supporting this feature in 32-bit mode or on 32-bit CPUs. This patch carves out 4 bits from the high half of vma->vm_flags and allows architectures to set config option to make them available. Sparse complains about these constants unless we explicitly call them "UL". Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Valentin Rothberg <valentinrothberg@gmail.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210208.81AF00D5@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-13 05:02:08 +08:00
select ARCH_USES_HIGH_VMA_FLAGS if X86_INTEL_MEMORY_PROTECTION_KEYS
select ARCH_HAS_PKEYS if X86_INTEL_MEMORY_PROTECTION_KEYS
config INSTRUCTION_DECODER
def_bool y
depends on KPROBES || PERF_EVENTS || UPROBES
config PERF_EVENTS_INTEL_UNCORE
def_bool y
depends on PERF_EVENTS && CPU_SUP_INTEL && PCI
config OUTPUT_FORMAT
string
default "elf32-i386" if X86_32
default "elf64-x86-64" if X86_64
config ARCH_DEFCONFIG
string
default "arch/x86/configs/i386_defconfig" if X86_32
default "arch/x86/configs/x86_64_defconfig" if X86_64
config LOCKDEP_SUPPORT
def_bool y
config STACKTRACE_SUPPORT
def_bool y
config MMU
def_bool y
x86: mm: support ARCH_MMAP_RND_BITS x86: arch_mmap_rnd() uses hard-coded values, 8 for 32-bit and 28 for 64-bit, to generate the random offset for the mmap base address. This value represents a compromise between increased ASLR effectiveness and avoiding address-space fragmentation. Replace it with a Kconfig option, which is sensibly bounded, so that platform developers may choose where to place this compromise. Keep default values as new minimums. Signed-off-by: Daniel Cashman <dcashman@google.com> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Nick Kralevich <nnk@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Borislav Petkov <bp@suse.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 07:20:06 +08:00
config ARCH_MMAP_RND_BITS_MIN
default 28 if 64BIT
default 8
config ARCH_MMAP_RND_BITS_MAX
default 32 if 64BIT
default 16
config ARCH_MMAP_RND_COMPAT_BITS_MIN
default 8
config ARCH_MMAP_RND_COMPAT_BITS_MAX
default 16
config SBUS
bool
config NEED_DMA_MAP_STATE
def_bool y
2015-04-18 03:04:48 +08:00
depends on X86_64 || INTEL_IOMMU || DMA_API_DEBUG || SWIOTLB
config NEED_SG_DMA_LENGTH
def_bool y
config GENERIC_ISA_DMA
def_bool y
depends on ISA_DMA_API
config GENERIC_BUG
def_bool y
depends on BUG
select GENERIC_BUG_RELATIVE_POINTERS if X86_64
config GENERIC_BUG_RELATIVE_POINTERS
bool
config GENERIC_HWEIGHT
def_bool y
config ARCH_MAY_HAVE_PC_FDC
def_bool y
depends on ISA_DMA_API
config RWSEM_XCHGADD_ALGORITHM
def_bool y
config GENERIC_CALIBRATE_DELAY
def_bool y
config ARCH_HAS_CPU_RELAX
def_bool y
config ARCH_HAS_CACHE_LINE_SIZE
def_bool y
config HAVE_SETUP_PER_CPU_AREA
def_bool y
config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool y
config NEED_PER_CPU_PAGE_FIRST_CHUNK
def_bool y
config ARCH_HIBERNATION_POSSIBLE
def_bool y
config ARCH_SUSPEND_POSSIBLE
def_bool y
config ARCH_WANT_HUGE_PMD_SHARE
def_bool y
config ARCH_WANT_GENERAL_HUGETLB
def_bool y
config ZONE_DMA32
def_bool y if X86_64
config AUDIT_ARCH
def_bool y if X86_64
config ARCH_SUPPORTS_OPTIMIZED_INLINING
def_bool y
config ARCH_SUPPORTS_DEBUG_PAGEALLOC
def_bool y
config KASAN_SHADOW_OFFSET
hex
depends on KASAN
default 0xdffffc0000000000
config HAVE_INTEL_TXT
def_bool y
depends on INTEL_IOMMU && ACPI
config X86_32_SMP
def_bool y
depends on X86_32 && SMP
config X86_64_SMP
def_bool y
depends on X86_64 && SMP
config X86_32_LAZY_GS
def_bool y
depends on X86_32 && !CC_STACKPROTECTOR
config ARCH_HWEIGHT_CFLAGS
string
default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints Add uprobes support to the core kernel, with x86 support. This commit adds the kernel facilities, the actual uprobes user-space ABI and perf probe support comes in later commits. General design: Uprobes are maintained in an rb-tree indexed by inode and offset (the offset here is from the start of the mapping). For a unique (inode, offset) tuple, there can be at most one uprobe in the rb-tree. Since the (inode, offset) tuple identifies a unique uprobe, more than one user may be interested in the same uprobe. This provides the ability to connect multiple 'consumers' to the same uprobe. Each consumer defines a handler and a filter (optional). The 'handler' is run every time the uprobe is hit, if it matches the 'filter' criteria. The first consumer of a uprobe causes the breakpoint to be inserted at the specified address and subsequent consumers are appended to this list. On subsequent probes, the consumer gets appended to the existing list of consumers. The breakpoint is removed when the last consumer unregisters. For all other unregisterations, the consumer is removed from the list of consumers. Given a inode, we get a list of the mms that have mapped the inode. Do the actual registration if mm maps the page where a probe needs to be inserted/removed. We use a temporary list to walk through the vmas that map the inode. - The number of maps that map the inode, is not known before we walk the rmap and keeps changing. - extending vm_area_struct wasn't recommended, it's a size-critical data structure. - There can be more than one maps of the inode in the same mm. We add callbacks to the mmap methods to keep an eye on text vmas that are of interest to uprobes. When a vma of interest is mapped, we insert the breakpoint at the right address. Uprobe works by replacing the instruction at the address defined by (inode, offset) with the arch specific breakpoint instruction. We save a copy of the original instruction at the uprobed address. This is needed for: a. executing the instruction out-of-line (xol). b. instruction analysis for any subsequent fixups. c. restoring the instruction back when the uprobe is unregistered. We insert or delete a breakpoint instruction, and this breakpoint instruction is assumed to be the smallest instruction available on the platform. For fixed size instruction platforms this is trivially true, for variable size instruction platforms the breakpoint instruction is typically the smallest (often a single byte). Writing the instruction is done by COWing the page and changing the instruction during the copy, this even though most platforms allow atomic writes of the breakpoint instruction. This also mirrors the behaviour of a ptrace() memory write to a PRIVATE file map. The core worker is derived from KSM's replace_page() logic. In essence, similar to KSM: a. allocate a new page and copy over contents of the page that has the uprobed vaddr b. modify the copy and insert the breakpoint at the required address c. switch the original page with the copy containing the breakpoint d. flush page tables. replace_page() is being replicated here because of some minor changes in the type of pages and also because Hugh Dickins had plans to improve replace_page() for KSM specific work. Instruction analysis on x86 is based on instruction decoder and determines if an instruction can be probed and determines the necessary fixups after singlestep. Instruction analysis is done at probe insertion time so that we avoid having to repeat the same analysis every time a probe is hit. A lot of code here is due to the improvement/suggestions/inputs from Peter Zijlstra. Changelog: (v10): - Add code to clear REX.B prefix as suggested by Denys Vlasenko and Masami Hiramatsu. (v9): - Use insn_offset_modrm as suggested by Masami Hiramatsu. (v7): Handle comments from Peter Zijlstra: - Dont take reference to inode. (expect inode to uprobe_register to be sane). - Use PTR_ERR to set the return value. - No need to take reference to inode. - use PTR_ERR to return error value. - register and uprobe_unregister share code. (v5): - Modified del_consumer as per comments from Peter. - Drop reference to inode before dropping reference to uprobe. - Use i_size_read(inode) instead of inode->i_size. - Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called. - Includes errno.h as recommended by Stephen Rothwell to fix a build issue on sparc defconfig - Remove restrictions while unregistering. - Earlier code leaked inode references under some conditions while registering/unregistering. - Continue the vma-rmap walk even if the intermediate vma doesnt meet the requirements. - Validate the vma found by find_vma before inserting/removing the breakpoint - Call del_consumer under mutex_lock. - Use hash locks. - Handle mremap. - Introduce find_least_offset_node() instead of close match logic in find_uprobe - Uprobes no more depends on MM_OWNER; No reference to task_structs while inserting/removing a probe. - Uses read_mapping_page instead of grab_cache_page so that the pages have valid content. - pass NULL to get_user_pages for the task parameter. - call SetPageUptodate on the new page allocated in write_opcode. - fix leaking a reference to the new page under certain conditions. - Include Instruction Decoder if Uprobes gets defined. - Remove const attributes for instruction prefix arrays. - Uses mm_context to know if the application is 32 bit. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Also-written-by: Jim Keniston <jkenisto@us.ibm.com> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Roland McGrath <roland@hack.frob.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Anton Arapov <anton@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linux-mm <linux-mm@kvack.org> Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com [ Made various small edits to the commit log ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-09 17:26:42 +08:00
config ARCH_SUPPORTS_UPROBES
def_bool y
config FIX_EARLYCON_MEM
def_bool y
config DEBUG_RODATA
def_bool y
config PGTABLE_LEVELS
int
default 4 if X86_64
default 3 if X86_PAE
default 2
source "init/Kconfig"
container freezer: implement freezer cgroup subsystem This patch implements a new freezer subsystem in the control groups framework. It provides a way to stop and resume execution of all tasks in a cgroup by writing in the cgroup filesystem. The freezer subsystem in the container filesystem defines a file named freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in the cgroup. Reading will return the current state. * Examples of usage : # mkdir /containers/freezer # mount -t cgroup -ofreezer freezer /containers # mkdir /containers/0 # echo $some_pid > /containers/0/tasks to get status of the freezer subsystem : # cat /containers/0/freezer.state RUNNING to freeze all tasks in the container : # echo FROZEN > /containers/0/freezer.state # cat /containers/0/freezer.state FREEZING # cat /containers/0/freezer.state FROZEN to unfreeze all tasks in the container : # echo RUNNING > /containers/0/freezer.state # cat /containers/0/freezer.state RUNNING This is the basic mechanism which should do the right thing for user space task in a simple scenario. It's important to note that freezing can be incomplete. In that case we return EBUSY. This means that some tasks in the cgroup are busy doing something that prevents us from completely freezing the cgroup at this time. After EBUSY, the cgroup will remain partially frozen -- reflected by freezer.state reporting "FREEZING" when read. The state will remain "FREEZING" until one of these things happens: 1) Userspace cancels the freezing operation by writing "RUNNING" to the freezer.state file 2) Userspace retries the freezing operation by writing "FROZEN" to the freezer.state file (writing "FREEZING" is not legal and returns EIO) 3) The tasks that blocked the cgroup from entering the "FROZEN" state disappear from the cgroup's set of tasks. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: export thaw_process] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:27:21 +08:00
source "kernel/Kconfig.freezer"
menu "Processor type and features"
config ZONE_DMA
bool "DMA memory allocation support" if EXPERT
default y
help
DMA memory allocation support allows devices with less than 32-bit
addressing to allocate within the first 16MB of address space.
Disable if no such devices will be used.
If unsure, say Y.
config SMP
bool "Symmetric multi-processing support"
---help---
This enables support for systems with more than one CPU. If you have
a system with only one CPU, say N. If you have a system with more
than one CPU, say Y.
If you say N here, the kernel will run on uni- and multiprocessor
machines, but will use only one CPU of a multiprocessor machine. If
you say Y here, the kernel will run on many, but not all,
uniprocessor machines. On a uniprocessor machine, the kernel
will run faster if you say N here.
Note that if you say Y here and choose architecture "586" or
"Pentium" under "Processor family", the kernel will not work on 486
architectures. Similarly, multiprocessor kernels for the "PPro"
architecture may not work on all Pentium based boards.
People using multiprocessor machines who say Y here should also say
Y to "Enhanced Real Time Clock Support", below. The "Advanced Power
Management" code will be disabled if you say Y here.
See also <file:Documentation/x86/i386/IO-APIC.txt>,
<file:Documentation/nmi_watchdog.txt> and the SMP-HOWTO available at
<http://www.tldp.org/docs.html#howto>.
If you don't know what to do here, say N.
config X86_FEATURE_NAMES
bool "Processor feature human-readable names" if EMBEDDED
default y
---help---
This option compiles in a table of x86 feature bits and corresponding
names. This is required to support /proc/cpuinfo and a few kernel
messages. You can disable this to save space, at the expense of
making those few kernel messages show numeric feature bits instead.
If in doubt, say Y.
config X86_FAST_FEATURE_TESTS
bool "Fast CPU feature tests" if EMBEDDED
default y
---help---
Some fast-paths in the kernel depend on the capabilities of the CPU.
Say Y here for the kernel to patch in the appropriate code at runtime
based on the capabilities of the CPU. The infrastructure for patching
code at runtime takes up some additional space; space-constrained
embedded systems may wish to say N here to produce smaller, slightly
slower code.
config X86_X2APIC
bool "Support x2apic"
depends on X86_LOCAL_APIC && X86_64 && (IRQ_REMAP || HYPERVISOR_GUEST)
---help---
This enables x2apic support on CPUs that have this feature.
This allows 32-bit apic IDs (so it can support very large systems),
and accesses the local apic via MSRs not via mmio.
If you don't know what to do here, say N.
config X86_MPPARSE
bool "Enable MPS table" if ACPI || SFI
default y
depends on X86_LOCAL_APIC
---help---
For old smp systems that do not have proper acpi support. Newer systems
(esp with 64bit cpus) with acpi support, MADT and DSDT will override it
config X86_BIGSMP
bool "Support for big SMP systems with more than 8 CPUs"
depends on X86_32 && SMP
---help---
This option is needed for the systems that have more than 8 CPUs
config GOLDFISH
def_bool y
depends on X86_GOLDFISH
if X86_32
config X86_EXTENDED_PLATFORM
bool "Support for extended (non-PC) x86 platforms"
default y
---help---
If you disable this option then the kernel will only support
standard PC platforms. (which covers the vast majority of
systems out there.)
If you enable this option then you'll be able to select support
for the following (non-PC) 32 bit x86 platforms:
Goldfish (Android emulator)
AMD Elan
RDC R-321x SoC
SGI 320/540 (Visual Workstation)
STA2X11-based (e.g. Northville)
Moorestown MID devices
If you have one of these systems, or if you want to build a
generic distribution kernel, say Y here - otherwise say N.
endif
if X86_64
config X86_EXTENDED_PLATFORM
bool "Support for extended (non-PC) x86 platforms"
default y
---help---
If you disable this option then the kernel will only support
standard PC platforms. (which covers the vast majority of
systems out there.)
If you enable this option then you'll be able to select support
for the following (non-PC) 64 bit x86 platforms:
Numascale NumaChip
ScaleMP vSMP
SGI Ultraviolet
If you have one of these systems, or if you want to build a
generic distribution kernel, say Y here - otherwise say N.
endif
# This is an alphabetically sorted list of 64 bit extended platforms
# Please maintain the alphabetic order if and when there are additions
config X86_NUMACHIP
bool "Numascale NumaChip"
depends on X86_64
depends on X86_EXTENDED_PLATFORM
depends on NUMA
depends on SMP
depends on X86_X2APIC
depends on PCI_MMCONFIG
---help---
Adds support for Numascale NumaChip large-SMP systems. Needed to
enable more than ~168 cores.
If you don't have one of these, you should say N here.
config X86_VSMP
bool "ScaleMP vSMP"
select HYPERVISOR_GUEST
select PARAVIRT
depends on X86_64 && PCI
depends on X86_EXTENDED_PLATFORM
depends on SMP
---help---
Support for ScaleMP vSMP systems. Say 'Y' here if this kernel is
supposed to run on these EM64T-based machines. Only choose this option
if you have one of these machines.
config X86_UV
bool "SGI Ultraviolet"
depends on X86_64
depends on X86_EXTENDED_PLATFORM
depends on NUMA
depends on EFI
depends on X86_X2APIC
depends on PCI
---help---
This option is needed in order to support SGI Ultraviolet systems.
If you don't have one of these, you should say N here.
# Following is an alphabetically sorted list of 32 bit extended platforms
# Please maintain the alphabetic order if and when there are additions
config X86_GOLDFISH
bool "Goldfish (Virtual Platform)"
depends on X86_EXTENDED_PLATFORM
---help---
Enable support for the Goldfish virtual platform used primarily
for Android development. Unless you are building for the Android
Goldfish emulator say N here.
config X86_INTEL_CE
bool "CE4100 TV platform"
depends on PCI
depends on PCI_GODIRECT
depends on X86_IO_APIC
depends on X86_32
depends on X86_EXTENDED_PLATFORM
select X86_REBOOTFIXUPS
select OF
select OF_EARLY_FLATTREE
---help---
Select for the Intel CE media processor (CE4100) SOC.
This option compiles in support for the CE4100 SOC for settop
boxes and media devices.
config X86_INTEL_MID
bool "Intel MID platform support"
depends on X86_EXTENDED_PLATFORM
depends on X86_PLATFORM_DEVICES
depends on PCI
depends on X86_64 || (PCI_GOANY && X86_32)
depends on X86_IO_APIC
select SFI
select I2C
select DW_APB_TIMER
select APB_TIMER
select INTEL_SCU_IPC
select MFD_INTEL_MSIC
---help---
Select to build a kernel capable of supporting Intel MID (Mobile
Internet Device) platform systems which do not have the PCI legacy
interfaces. If you are building for a PC class system say N here.
Intel MID platforms are based on an Intel processor and chipset which
consume less power than most of the x86 derivatives.
config X86_INTEL_QUARK
bool "Intel Quark platform support"
depends on X86_32
depends on X86_EXTENDED_PLATFORM
depends on X86_PLATFORM_DEVICES
depends on X86_TSC
depends on PCI
depends on PCI_GOANY
depends on X86_IO_APIC
select IOSF_MBI
select INTEL_IMR
select COMMON_CLK
---help---
Select to include support for Quark X1000 SoC.
Say Y here if you have a Quark based system such as the Arduino
compatible Intel Galileo.
config X86_INTEL_LPSS
bool "Intel Low Power Subsystem Support"
depends on X86 && ACPI
select COMMON_CLK
select PINCTRL
select IOSF_MBI
---help---
Select to build support for Intel Low Power Subsystem such as
found on Intel Lynxpoint PCH. Selecting this option enables
things like clock tree (common clock framework) and pincontrol
which are needed by the LPSS peripheral drivers.
config X86_AMD_PLATFORM_DEVICE
bool "AMD ACPI2Platform devices support"
depends on ACPI
select COMMON_CLK
select PINCTRL
---help---
Select to interpret AMD specific ACPI device to platform device
such as I2C, UART, GPIO found on AMD Carrizo and later chipsets.
I2C and UART depend on COMMON_CLK to set clock. GPIO driver is
implemented under PINCTRL subsystem.
config IOSF_MBI
tristate "Intel SoC IOSF Sideband support for SoC platforms"
depends on PCI
---help---
This option enables sideband register access support for Intel SoC
platforms. On these platforms the IOSF sideband is used in lieu of
MSR's for some register accesses, mostly but not limited to thermal
and power. Drivers may query the availability of this device to
determine if they need the sideband in order to work on these
platforms. The sideband is available on the following SoC products.
This list is not meant to be exclusive.
- BayTrail
- Braswell
- Quark
You should say Y if you are running a kernel on one of these SoC's.
config IOSF_MBI_DEBUG
bool "Enable IOSF sideband access through debugfs"
depends on IOSF_MBI && DEBUG_FS
---help---
Select this option to expose the IOSF sideband access registers (MCR,
MDR, MCRX) through debugfs to write and read register information from
different units on the SoC. This is most useful for obtaining device
state information for debug and analysis. As this is a general access
mechanism, users of this option would have specific knowledge of the
device they want to access.
If you don't require the option or are in doubt, say N.
config X86_RDC321X
bool "RDC R-321x SoC"
depends on X86_32
depends on X86_EXTENDED_PLATFORM
select M486
select X86_REBOOTFIXUPS
---help---
This option is needed for RDC R-321x system-on-chip, also known
as R-8610-(G).
If you don't have one of these chips, you should say N here.
config X86_32_NON_STANDARD
bool "Support non-standard 32-bit SMP architectures"
depends on X86_32 && SMP
depends on X86_EXTENDED_PLATFORM
---help---
This option compiles in the bigsmp and STA2X11 default
subarchitectures. It is intended for a generic binary
kernel. If you select them all, kernel will probe it one by
one and will fallback to default.
# Alphabetically sorted list of Non standard 32 bit platforms
config X86_SUPPORTS_MEMORY_FAILURE
def_bool y
# MCE code calls memory_failure():
depends on X86_MCE
# On 32-bit this adds too big of NODES_SHIFT and we run out of page flags:
# On 32-bit SPARSEMEM adds too big of SECTIONS_WIDTH:
depends on X86_64 || !SPARSEMEM
select ARCH_SUPPORTS_MEMORY_FAILURE
config STA2X11
bool "STA2X11 Companion Chip Support"
depends on X86_32_NON_STANDARD && PCI
select X86_DEV_DMA_OPS
select X86_DMA_REMAP
select SWIOTLB
select MFD_STA2X11
select ARCH_REQUIRE_GPIOLIB
default n
---help---
This adds support for boards based on the STA2X11 IO-Hub,
a.k.a. "ConneXt". The chip is used in place of the standard
PC chipset, so all "standard" peripherals are missing. If this
option is selected the kernel will still be able to boot on
standard PC machines.
config X86_32_IRIS
tristate "Eurobraille/Iris poweroff module"
depends on X86_32
---help---
The Iris machines from EuroBraille do not have APM or ACPI support
to shut themselves down properly. A special I/O sequence is
needed to do so, which is what this module does at
kernel shutdown.
This is only for Iris machines from EuroBraille.
If unused, say N.
config SCHED_OMIT_FRAME_POINTER
def_bool y
prompt "Single-depth WCHAN output"
depends on X86
---help---
Calculate simpler /proc/<PID>/wchan values. If this option
is disabled then wchan values will recurse back to the
caller function. This provides more accurate wchan values,
at the expense of slightly more scheduling overhead.
If in doubt, say "Y".
menuconfig HYPERVISOR_GUEST
bool "Linux guest support"
---help---
Say Y here to enable options for running Linux under various hyper-
visors. This option enables basic hypervisor detection and platform
setup.
If you say N, all options in this submenu will be skipped and
disabled, and Linux guest support won't be built in.
if HYPERVISOR_GUEST
config PARAVIRT
bool "Enable paravirtualization code"
---help---
This changes the kernel so it can modify itself when it is run
under a hypervisor, potentially improving performance significantly
over full virtualization. However, when run without a hypervisor
the kernel is theoretically slower and slightly larger.
config PARAVIRT_DEBUG
bool "paravirt-ops debugging"
depends on PARAVIRT && DEBUG_KERNEL
---help---
Enable to debug paravirt_ops internals. Specifically, BUG if
a paravirt_op is missing when it is called.
x86: Fix performance regression caused by paravirt_ops on native kernels Xiaohui Xin and some other folks at Intel have been looking into what's behind the performance hit of paravirt_ops when running native. It appears that the hit is entirely due to the paravirtualized spinlocks introduced by: | commit 8efcbab674de2bee45a2e4cdf97de16b8e609ac8 | Date: Mon Jul 7 12:07:51 2008 -0700 | | paravirt: introduce a "lock-byte" spinlock implementation The extra call/return in the spinlock path is somehow causing an increase in the cycles/instruction of somewhere around 2-7% (seems to vary quite a lot from test to test). The working theory is that the CPU's pipeline is getting upset about the call->call->locked-op->return->return, and seems to be failing to speculate (though I haven't seen anything definitive about the precise reasons). This doesn't entirely make sense, because the performance hit is also visible on unlock and other operations which don't involve locked instructions. But spinlock operations clearly swamp all the other pvops operations, even though I can't imagine that they're nearly as common (there's only a .05% increase in instructions executed). If I disable just the pv-spinlock calls, my tests show that pvops is identical to non-pvops performance on native (my measurements show that it is actually about .1% faster, but Xiaohui shows a .05% slowdown). Summary of results, averaging 10 runs of the "mmperf" test, using a no-pvops build as baseline: nopv Pv-nospin Pv-spin CPU cycles 100.00% 99.89% 102.18% instructions 100.00% 100.10% 100.15% CPI 100.00% 99.79% 102.03% cache ref 100.00% 100.84% 100.28% cache miss 100.00% 90.47% 88.56% cache miss rate 100.00% 89.72% 88.31% branches 100.00% 99.93% 100.04% branch miss 100.00% 103.66% 107.72% branch miss rt 100.00% 103.73% 107.67% wallclock 100.00% 99.90% 102.20% The clear effect here is that the 2% increase in CPI is directly reflected in the final wallclock time. (The other interesting effect is that the more ops are out of line calls via pvops, the lower the cache access and miss rates. Not too surprising, but it suggests that the non-pvops kernel is over-inlined. On the flipside, the branch misses go up correspondingly...) So, what's the fix? Paravirt patching turns all the pvops calls into direct calls, so _spin_lock etc do end up having direct calls. For example, the compiler generated code for paravirtualized _spin_lock is: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq *0xffffffff805a5b30 <_spin_lock+22>: retq The indirect call will get patched to: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq <__ticket_spin_lock> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ <_spin_lock+22>: retq One possibility is to inline _spin_lock, etc, when building an optimised kernel (ie, when there's no spinlock/preempt instrumentation/debugging enabled). That will remove the outer call/return pair, returning the instruction stream to a single call/return, which will presumably execute the same as the non-pvops case. The downsides arel 1) it will replicate the preempt_disable/enable code at eack lock/unlock callsite; this code is fairly small, but not nothing; and 2) the spinlock definitions are already a very heavily tangled mass of #ifdefs and other preprocessor magic, and making any changes will be non-trivial. The other obvious answer is to disable pv-spinlocks. Making them a separate config option is fairly easy, and it would be trivial to enable them only when Xen is enabled (as the only non-default user). But it doesn't really address the common case of a distro build which is going to have Xen support enabled, and leaves the open question of whether the native performance cost of pv-spinlocks is worth the performance improvement on a loaded Xen system (10% saving of overall system CPU when guests block rather than spin). Still it is a reasonable short-term workaround. [ Impact: fix pvops performance regression when running native ] Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com> Analysed-by: "Li Xin" <xin.li@intel.com> Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Xen-devel <xen-devel@lists.xensource.com> LKML-Reference: <4A0B62F7.5030802@goop.org> [ fixed the help text ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-14 08:16:55 +08:00
config PARAVIRT_SPINLOCKS
bool "Paravirtualization layer for spinlocks"
depends on PARAVIRT && SMP
select UNINLINE_SPIN_UNLOCK if !QUEUED_SPINLOCKS
x86: Fix performance regression caused by paravirt_ops on native kernels Xiaohui Xin and some other folks at Intel have been looking into what's behind the performance hit of paravirt_ops when running native. It appears that the hit is entirely due to the paravirtualized spinlocks introduced by: | commit 8efcbab674de2bee45a2e4cdf97de16b8e609ac8 | Date: Mon Jul 7 12:07:51 2008 -0700 | | paravirt: introduce a "lock-byte" spinlock implementation The extra call/return in the spinlock path is somehow causing an increase in the cycles/instruction of somewhere around 2-7% (seems to vary quite a lot from test to test). The working theory is that the CPU's pipeline is getting upset about the call->call->locked-op->return->return, and seems to be failing to speculate (though I haven't seen anything definitive about the precise reasons). This doesn't entirely make sense, because the performance hit is also visible on unlock and other operations which don't involve locked instructions. But spinlock operations clearly swamp all the other pvops operations, even though I can't imagine that they're nearly as common (there's only a .05% increase in instructions executed). If I disable just the pv-spinlock calls, my tests show that pvops is identical to non-pvops performance on native (my measurements show that it is actually about .1% faster, but Xiaohui shows a .05% slowdown). Summary of results, averaging 10 runs of the "mmperf" test, using a no-pvops build as baseline: nopv Pv-nospin Pv-spin CPU cycles 100.00% 99.89% 102.18% instructions 100.00% 100.10% 100.15% CPI 100.00% 99.79% 102.03% cache ref 100.00% 100.84% 100.28% cache miss 100.00% 90.47% 88.56% cache miss rate 100.00% 89.72% 88.31% branches 100.00% 99.93% 100.04% branch miss 100.00% 103.66% 107.72% branch miss rt 100.00% 103.73% 107.67% wallclock 100.00% 99.90% 102.20% The clear effect here is that the 2% increase in CPI is directly reflected in the final wallclock time. (The other interesting effect is that the more ops are out of line calls via pvops, the lower the cache access and miss rates. Not too surprising, but it suggests that the non-pvops kernel is over-inlined. On the flipside, the branch misses go up correspondingly...) So, what's the fix? Paravirt patching turns all the pvops calls into direct calls, so _spin_lock etc do end up having direct calls. For example, the compiler generated code for paravirtualized _spin_lock is: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq *0xffffffff805a5b30 <_spin_lock+22>: retq The indirect call will get patched to: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq <__ticket_spin_lock> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ <_spin_lock+22>: retq One possibility is to inline _spin_lock, etc, when building an optimised kernel (ie, when there's no spinlock/preempt instrumentation/debugging enabled). That will remove the outer call/return pair, returning the instruction stream to a single call/return, which will presumably execute the same as the non-pvops case. The downsides arel 1) it will replicate the preempt_disable/enable code at eack lock/unlock callsite; this code is fairly small, but not nothing; and 2) the spinlock definitions are already a very heavily tangled mass of #ifdefs and other preprocessor magic, and making any changes will be non-trivial. The other obvious answer is to disable pv-spinlocks. Making them a separate config option is fairly easy, and it would be trivial to enable them only when Xen is enabled (as the only non-default user). But it doesn't really address the common case of a distro build which is going to have Xen support enabled, and leaves the open question of whether the native performance cost of pv-spinlocks is worth the performance improvement on a loaded Xen system (10% saving of overall system CPU when guests block rather than spin). Still it is a reasonable short-term workaround. [ Impact: fix pvops performance regression when running native ] Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com> Analysed-by: "Li Xin" <xin.li@intel.com> Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Xen-devel <xen-devel@lists.xensource.com> LKML-Reference: <4A0B62F7.5030802@goop.org> [ fixed the help text ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-14 08:16:55 +08:00
---help---
Paravirtualized spinlocks allow a pvops backend to replace the
spinlock implementation with something virtualization-friendly
(for example, block the virtual CPU rather than spinning).
It has a minimal impact on native kernels and gives a nice performance
benefit on paravirtualized KVM / Xen kernels.
x86: Fix performance regression caused by paravirt_ops on native kernels Xiaohui Xin and some other folks at Intel have been looking into what's behind the performance hit of paravirt_ops when running native. It appears that the hit is entirely due to the paravirtualized spinlocks introduced by: | commit 8efcbab674de2bee45a2e4cdf97de16b8e609ac8 | Date: Mon Jul 7 12:07:51 2008 -0700 | | paravirt: introduce a "lock-byte" spinlock implementation The extra call/return in the spinlock path is somehow causing an increase in the cycles/instruction of somewhere around 2-7% (seems to vary quite a lot from test to test). The working theory is that the CPU's pipeline is getting upset about the call->call->locked-op->return->return, and seems to be failing to speculate (though I haven't seen anything definitive about the precise reasons). This doesn't entirely make sense, because the performance hit is also visible on unlock and other operations which don't involve locked instructions. But spinlock operations clearly swamp all the other pvops operations, even though I can't imagine that they're nearly as common (there's only a .05% increase in instructions executed). If I disable just the pv-spinlock calls, my tests show that pvops is identical to non-pvops performance on native (my measurements show that it is actually about .1% faster, but Xiaohui shows a .05% slowdown). Summary of results, averaging 10 runs of the "mmperf" test, using a no-pvops build as baseline: nopv Pv-nospin Pv-spin CPU cycles 100.00% 99.89% 102.18% instructions 100.00% 100.10% 100.15% CPI 100.00% 99.79% 102.03% cache ref 100.00% 100.84% 100.28% cache miss 100.00% 90.47% 88.56% cache miss rate 100.00% 89.72% 88.31% branches 100.00% 99.93% 100.04% branch miss 100.00% 103.66% 107.72% branch miss rt 100.00% 103.73% 107.67% wallclock 100.00% 99.90% 102.20% The clear effect here is that the 2% increase in CPI is directly reflected in the final wallclock time. (The other interesting effect is that the more ops are out of line calls via pvops, the lower the cache access and miss rates. Not too surprising, but it suggests that the non-pvops kernel is over-inlined. On the flipside, the branch misses go up correspondingly...) So, what's the fix? Paravirt patching turns all the pvops calls into direct calls, so _spin_lock etc do end up having direct calls. For example, the compiler generated code for paravirtualized _spin_lock is: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq *0xffffffff805a5b30 <_spin_lock+22>: retq The indirect call will get patched to: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq <__ticket_spin_lock> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ <_spin_lock+22>: retq One possibility is to inline _spin_lock, etc, when building an optimised kernel (ie, when there's no spinlock/preempt instrumentation/debugging enabled). That will remove the outer call/return pair, returning the instruction stream to a single call/return, which will presumably execute the same as the non-pvops case. The downsides arel 1) it will replicate the preempt_disable/enable code at eack lock/unlock callsite; this code is fairly small, but not nothing; and 2) the spinlock definitions are already a very heavily tangled mass of #ifdefs and other preprocessor magic, and making any changes will be non-trivial. The other obvious answer is to disable pv-spinlocks. Making them a separate config option is fairly easy, and it would be trivial to enable them only when Xen is enabled (as the only non-default user). But it doesn't really address the common case of a distro build which is going to have Xen support enabled, and leaves the open question of whether the native performance cost of pv-spinlocks is worth the performance improvement on a loaded Xen system (10% saving of overall system CPU when guests block rather than spin). Still it is a reasonable short-term workaround. [ Impact: fix pvops performance regression when running native ] Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com> Analysed-by: "Li Xin" <xin.li@intel.com> Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Xen-devel <xen-devel@lists.xensource.com> LKML-Reference: <4A0B62F7.5030802@goop.org> [ fixed the help text ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-14 08:16:55 +08:00
If you are unsure how to answer this question, answer Y.
x86: Fix performance regression caused by paravirt_ops on native kernels Xiaohui Xin and some other folks at Intel have been looking into what's behind the performance hit of paravirt_ops when running native. It appears that the hit is entirely due to the paravirtualized spinlocks introduced by: | commit 8efcbab674de2bee45a2e4cdf97de16b8e609ac8 | Date: Mon Jul 7 12:07:51 2008 -0700 | | paravirt: introduce a "lock-byte" spinlock implementation The extra call/return in the spinlock path is somehow causing an increase in the cycles/instruction of somewhere around 2-7% (seems to vary quite a lot from test to test). The working theory is that the CPU's pipeline is getting upset about the call->call->locked-op->return->return, and seems to be failing to speculate (though I haven't seen anything definitive about the precise reasons). This doesn't entirely make sense, because the performance hit is also visible on unlock and other operations which don't involve locked instructions. But spinlock operations clearly swamp all the other pvops operations, even though I can't imagine that they're nearly as common (there's only a .05% increase in instructions executed). If I disable just the pv-spinlock calls, my tests show that pvops is identical to non-pvops performance on native (my measurements show that it is actually about .1% faster, but Xiaohui shows a .05% slowdown). Summary of results, averaging 10 runs of the "mmperf" test, using a no-pvops build as baseline: nopv Pv-nospin Pv-spin CPU cycles 100.00% 99.89% 102.18% instructions 100.00% 100.10% 100.15% CPI 100.00% 99.79% 102.03% cache ref 100.00% 100.84% 100.28% cache miss 100.00% 90.47% 88.56% cache miss rate 100.00% 89.72% 88.31% branches 100.00% 99.93% 100.04% branch miss 100.00% 103.66% 107.72% branch miss rt 100.00% 103.73% 107.67% wallclock 100.00% 99.90% 102.20% The clear effect here is that the 2% increase in CPI is directly reflected in the final wallclock time. (The other interesting effect is that the more ops are out of line calls via pvops, the lower the cache access and miss rates. Not too surprising, but it suggests that the non-pvops kernel is over-inlined. On the flipside, the branch misses go up correspondingly...) So, what's the fix? Paravirt patching turns all the pvops calls into direct calls, so _spin_lock etc do end up having direct calls. For example, the compiler generated code for paravirtualized _spin_lock is: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq *0xffffffff805a5b30 <_spin_lock+22>: retq The indirect call will get patched to: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq <__ticket_spin_lock> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ <_spin_lock+22>: retq One possibility is to inline _spin_lock, etc, when building an optimised kernel (ie, when there's no spinlock/preempt instrumentation/debugging enabled). That will remove the outer call/return pair, returning the instruction stream to a single call/return, which will presumably execute the same as the non-pvops case. The downsides arel 1) it will replicate the preempt_disable/enable code at eack lock/unlock callsite; this code is fairly small, but not nothing; and 2) the spinlock definitions are already a very heavily tangled mass of #ifdefs and other preprocessor magic, and making any changes will be non-trivial. The other obvious answer is to disable pv-spinlocks. Making them a separate config option is fairly easy, and it would be trivial to enable them only when Xen is enabled (as the only non-default user). But it doesn't really address the common case of a distro build which is going to have Xen support enabled, and leaves the open question of whether the native performance cost of pv-spinlocks is worth the performance improvement on a loaded Xen system (10% saving of overall system CPU when guests block rather than spin). Still it is a reasonable short-term workaround. [ Impact: fix pvops performance regression when running native ] Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com> Analysed-by: "Li Xin" <xin.li@intel.com> Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Xen-devel <xen-devel@lists.xensource.com> LKML-Reference: <4A0B62F7.5030802@goop.org> [ fixed the help text ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-14 08:16:55 +08:00
locking/pvqspinlock: Collect slowpath lock statistics This patch enables the accumulation of kicking and waiting related PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration option is selected. It also enables the collection of data which enable us to calculate the kicking and wakeup latencies which have a heavy dependency on the CPUs being used. The statistical counters are per-cpu variables to minimize the performance overhead in their updates. These counters are exported via the debugfs filesystem under the qlockstat directory. When the corresponding debugfs files are read, summation and computing of the required data are then performed. The measured latencies for different CPUs are: CPU Wakeup Kicking --- ------ ------- Haswell-EX 63.6us 7.4us Westmere-EX 67.6us 9.3us The measured latencies varied a bit from run-to-run. The wakeup latency is much higher than the kicking latency. A sample of statistical counters after system bootup (with vCPU overcommit) was: pv_hash_hops=1.00 pv_kick_unlock=1148 pv_kick_wake=1146 pv_latency_kick=11040 pv_latency_wake=194840 pv_spurious_wakeup=7 pv_wait_again=4 pv_wait_head=23 pv_wait_node=1129 Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447114167-47185-6-git-send-email-Waiman.Long@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-10 08:09:25 +08:00
config QUEUED_LOCK_STAT
bool "Paravirt queued spinlock statistics"
depends on PARAVIRT_SPINLOCKS && DEBUG_FS && QUEUED_SPINLOCKS
---help---
Enable the collection of statistical data on the slowpath
behavior of paravirtualized queued spinlocks and report
them on debugfs.
source "arch/x86/xen/Kconfig"
config KVM_GUEST
bool "KVM Guest support (including kvmclock)"
depends on PARAVIRT
select PARAVIRT_CLOCK
default y
---help---
This option enables various optimizations for running under the KVM
hypervisor. It includes a paravirtualized clock, so that instead
of relying on a PIT (or probably other) emulation by the
underlying device model, the host provides the guest with
timing infrastructure such as time of day, and system time
config KVM_DEBUG_FS
bool "Enable debug information for KVM Guests in debugfs"
depends on KVM_GUEST && DEBUG_FS
default n
---help---
This option enables collection of various statistics for KVM guest.
Statistics are displayed in debugfs filesystem. Enabling this option
may incur significant overhead.
source "arch/x86/lguest/Kconfig"
config PARAVIRT_TIME_ACCOUNTING
bool "Paravirtual steal time accounting"
depends on PARAVIRT
default n
---help---
Select this option to enable fine granularity task steal time
accounting. Time spent executing other tasks in parallel with
the current vCPU is discounted from the vCPU power. To account for
that, there can be a small performance impact.
If in doubt, say N here.
config PARAVIRT_CLOCK
bool
endif #HYPERVISOR_GUEST
config NO_BOOTMEM
def_bool y
source "arch/x86/Kconfig.cpu"
config HPET_TIMER
def_bool X86_64
prompt "HPET Timer Support" if X86_32
---help---
Use the IA-PC HPET (High Precision Event Timer) to manage
time in preference to the PIT and RTC, if a HPET is
present.
HPET is the next generation timer replacing legacy 8254s.
The HPET provides a stable time base on SMP
systems, unlike the TSC, but it is more expensive to access,
as it is off-chip. The interface used is documented
in the HPET spec, revision 1.
You can safely choose Y here. However, HPET will only be
activated if the platform and the BIOS support this feature.
Otherwise the 8254 will be used for timing services.
Choose N to continue using the legacy 8254 timer.
config HPET_EMULATE_RTC
def_bool y
depends on HPET_TIMER && (RTC=y || RTC=m || RTC_DRV_CMOS=m || RTC_DRV_CMOS=y)
config APB_TIMER
def_bool y if X86_INTEL_MID
prompt "Intel MID APB Timer Support" if X86_INTEL_MID
select DW_APB_TIMER
depends on X86_INTEL_MID && SFI
help
APB timer is the replacement for 8254, HPET on X86 MID platforms.
The APBT provides a stable time base on SMP
systems, unlike the TSC, but it is more expensive to access,
as it is off-chip. APB timers are always running regardless of CPU
C states, they are used as per CPU clockevent device when possible.
# Mark as expert because too many people got it wrong.
# The code disables itself when not needed.
config DMI
default y
select DMI_SCAN_MACHINE_NON_EFI_FALLBACK
bool "Enable DMI scanning" if EXPERT
---help---
Enabled scanning of DMI to identify machine quirks. Say Y
here unless you have verified that your setup is not
affected by entries in the DMI blacklist. Required by PNP
BIOS code.
config GART_IOMMU
bool "Old AMD GART IOMMU support"
select SWIOTLB
depends on X86_64 && PCI && AMD_NB
---help---
Provides a driver for older AMD Athlon64/Opteron/Turion/Sempron
GART based hardware IOMMUs.
The GART supports full DMA access for devices with 32-bit access
limitations, on systems with more than 3 GB. This is usually needed
for USB, sound, many IDE/SATA chipsets and some other devices.
Newer systems typically have a modern AMD IOMMU, supported via
the CONFIG_AMD_IOMMU=y config option.
In normal configurations this driver is only active when needed:
there's more than 3 GB of memory and the system contains a
32-bit limited device.
If unsure, say Y.
config CALGARY_IOMMU
bool "IBM Calgary IOMMU support"
select SWIOTLB
depends on X86_64 && PCI
---help---
Support for hardware IOMMUs in IBM's xSeries x366 and x460
systems. Needed to run systems with more than 3GB of memory
properly with 32-bit PCI devices that do not support DAC
(Double Address Cycle). Calgary also supports bus level
isolation, where all DMAs pass through the IOMMU. This
prevents them from going anywhere except their intended
destination. This catches hard-to-find kernel bugs and
mis-behaving drivers and devices that do not use the DMA-API
properly to set up their DMA buffers. The IOMMU can be
turned off at boot time with the iommu=off parameter.
Normally the kernel will make the right choice by itself.
If unsure, say Y.
config CALGARY_IOMMU_ENABLED_BY_DEFAULT
def_bool y
prompt "Should Calgary be enabled by default?"
depends on CALGARY_IOMMU
---help---
Should Calgary be enabled by default? if you choose 'y', Calgary
will be used (if it exists). If you choose 'n', Calgary will not be
used even if it exists. If you choose 'n' and would like to use
Calgary anyway, pass 'iommu=calgary' on the kernel command line.
If unsure, say Y.
# need this always selected by IOMMU for the VIA workaround
config SWIOTLB
def_bool y if X86_64
---help---
Support for software bounce buffers used on x86-64 systems
which don't have a hardware IOMMU. Using this PCI devices
which can only access 32-bits of memory can be used on systems
with more than 3 GB of memory.
If unsure, say Y.
config IOMMU_HELPER
def_bool y
depends on CALGARY_IOMMU || GART_IOMMU || SWIOTLB || AMD_IOMMU
config MAXSMP
bool "Enable Maximum number of SMP Processors and NUMA Nodes"
depends on X86_64 && SMP && DEBUG_KERNEL
select CPUMASK_OFFSTACK
---help---
Enable maximum number of CPUS and NUMA Nodes for this architecture.
If unsure, say N.
config NR_CPUS
int "Maximum number of CPUs" if SMP && !MAXSMP
range 2 8 if SMP && X86_32 && !X86_BIGSMP
range 2 512 if SMP && !MAXSMP && !CPUMASK_OFFSTACK
range 2 8192 if SMP && !MAXSMP && CPUMASK_OFFSTACK && X86_64
default "1" if !SMP
default "8192" if MAXSMP
default "32" if SMP && X86_BIGSMP
default "8" if SMP && X86_32
default "64" if SMP
---help---
This allows you to specify the maximum number of CPUs which this
kernel will support. If CPUMASK_OFFSTACK is enabled, the maximum
supported value is 8192, otherwise the maximum value is 512. The
minimum value which makes sense is 2.
This is purely to save memory - each supported CPU adds
approximately eight kilobytes to the kernel image.
config SCHED_SMT
bool "SMT (Hyperthreading) scheduler support"
depends on SMP
---help---
SMT scheduler support improves the CPU scheduler's decision making
when dealing with Intel Pentium 4 chips with HyperThreading at a
cost of slightly increased overhead in some places. If unsure say
N here.
config SCHED_MC
def_bool y
prompt "Multi-core scheduler support"
depends on SMP
---help---
Multi-core scheduler support improves the CPU scheduler's decision
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
source "kernel/Kconfig.preempt"
config UP_LATE_INIT
def_bool y
depends on !SMP && X86_LOCAL_APIC
config X86_UP_APIC
bool "Local APIC support on uniprocessors" if !PCI_MSI
default PCI_MSI
depends on X86_32 && !SMP && !X86_32_NON_STANDARD
---help---
A local APIC (Advanced Programmable Interrupt Controller) is an
integrated interrupt controller in the CPU. If you have a single-CPU
system which has a processor with a local APIC, you can say Y here to
enable and use it. If you say Y here even though your machine doesn't
have a local APIC, then the kernel will still run with no slowdown at
all. The local APIC supports CPU-generated self-interrupts (timer,
performance counters), and the NMI watchdog which detects hard
lockups.
config X86_UP_IOAPIC
bool "IO-APIC support on uniprocessors"
depends on X86_UP_APIC
---help---
An IO-APIC (I/O Advanced Programmable Interrupt Controller) is an
SMP-capable replacement for PC-style interrupt controllers. Most
SMP systems and many recent uniprocessor systems have one.
If you have a single-CPU system with an IO-APIC, you can say Y here
to use it. If you say Y here even though your machine doesn't have
an IO-APIC, then the kernel will still run with no slowdown at all.
config X86_LOCAL_APIC
def_bool y
x86, build, pci: Fix PCI_MSI build on !SMP Commit ebd97be635 ('PCI: remove ARCH_SUPPORTS_MSI kconfig option') removed the ARCH_SUPPORTS_MSI option which architectures could select to indicate that they support MSI. Now, all architectures are supposed to build fine when MSI support is enabled: instead of having the architecture tell *when* MSI support can be used, it's up to the architecture code to ensure that MSI support can be enabled. On x86, commit ebd97be635 removed the following line: select ARCH_SUPPORTS_MSI if (X86_LOCAL_APIC && X86_IO_APIC) Which meant that MSI support was only available when the local APIC and I/O APIC were enabled. While this is always true on SMP or x86-64, it is not necessarily the case on i386 !SMP. The below patch makes sure that the local APIC and I/O APIC support is always enabled when MSI support is enabled. To do so, it: * Ensures the X86_UP_APIC option is not visible when PCI_MSI is enabled. This is the option that allows, on UP machines, to enable or not the APIC support. It is already not visible on SMP systems, or x86-64 systems, for example. We're simply also making it invisible on i386 MSI systems. * Ensures that the X86_LOCAL_APIC and X86_IO_APIC options are 'y' when PCI_MSI is enabled. Notice that this change requires a change in drivers/iommu/Kconfig to avoid a recursive Kconfig dependencey. The AMD_IOMMU option selects PCI_MSI, but was depending on X86_IO_APIC. This dependency is no longer needed: as soon as PCI_MSI is selected, the presence of X86_IO_APIC is guaranteed. Moreover, the AMD_IOMMU already depended on X86_64, which already guaranteed that X86_IO_APIC was enabled, so this dependency was anyway redundant. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Link: http://lkml.kernel.org/r/1380794354-9079-1-git-send-email-thomas.petazzoni@free-electrons.com Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-10-03 17:59:14 +08:00
depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
select IRQ_DOMAIN_HIERARCHY
select PCI_MSI_IRQ_DOMAIN if PCI_MSI
config X86_IO_APIC
def_bool y
depends on X86_LOCAL_APIC || X86_UP_IOAPIC
config X86_REROUTE_FOR_BROKEN_BOOT_IRQS
bool "Reroute for broken boot IRQs"
depends on X86_IO_APIC
---help---
This option enables a workaround that fixes a source of
spurious interrupts. This is recommended when threaded
interrupt handling is used on systems where the generation of
superfluous "boot interrupts" cannot be disabled.
Some chipsets generate a legacy INTx "boot IRQ" when the IRQ
entry in the chipset's IO-APIC is masked (as, e.g. the RT
kernel does during interrupt handling). On chipsets where this
boot IRQ generation cannot be disabled, this workaround keeps
the original IRQ line masked so that only the equivalent "boot
IRQ" is delivered to the CPUs. The workaround also tells the
kernel to set up the IRQ handler on the boot IRQ line. In this
way only one interrupt is delivered to the kernel. Otherwise
the spurious second interrupt may cause the kernel to bring
down (vital) interrupt lines.
Only affects "broken" chipsets. Interrupt sharing may be
increased on these systems.
config X86_MCE
bool "Machine Check / overheating reporting"
select GENERIC_ALLOCATOR
default y
---help---
Machine Check support allows the processor to notify the
kernel if it detects a problem (e.g. overheating, data corruption).
The action the kernel takes depends on the severity of the problem,
ranging from warning messages to halting the machine.
x86, mce: use 64bit machine check code on 32bit The 64bit machine check code is in many ways much better than the 32bit machine check code: it is more specification compliant, is cleaner, only has a single code base versus one per CPU, has better infrastructure for recovery, has a cleaner way to communicate with user space etc. etc. Use the 64bit code for 32bit too. This is the second attempt to do this. There was one a couple of years ago to unify this code for 32bit and 64bit. Back then this ran into some trouble with K7s and was reverted. I believe this time the K7 problems (and some others) are addressed. I went over the old handlers and was very careful to retain all quirks. But of course this needs a lot of testing on old systems. On newer 64bit capable systems I don't expect much problems because they have been already tested with the 64bit kernel. I made this a CONFIG for now that still allows to select the old machine check code. This is mostly to make testing easier, if someone runs into a problem we can ask them to try with the CONFIG switched. The new code is default y for more coverage. Once there is confidence the 64bit code works well on older hardware too the CONFIG_X86_OLD_MCE and the associated code can be easily removed. This causes a behaviour change for 32bit installations. They now have to install the mcelog package to be able to log corrected machine checks. The 64bit machine check code only handles CPUs which support the standard Intel machine check architecture described in the IA32 SDM. The 32bit code has special support for some older CPUs which have non standard machine check architectures, in particular WinChip C3 and Intel P5. I made those a separate CONFIG option and kept them for now. The WinChip variant could be probably removed without too much pain, it doesn't really do anything interesting. P5 is also disabled by default (like it was before) because many motherboards have it miswired, but according to Alan Cox a few embedded setups use that one. Forward ported/heavily changed version of old patch, original patch included review/fixes from Thomas Gleixner, Bert Wesarg. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-04-29 01:07:31 +08:00
config X86_MCE_INTEL
def_bool y
prompt "Intel MCE features"
depends on X86_MCE && X86_LOCAL_APIC
---help---
Additional support for intel specific MCE features such as
the thermal monitor.
config X86_MCE_AMD
def_bool y
prompt "AMD MCE features"
depends on X86_MCE && X86_LOCAL_APIC
---help---
Additional support for AMD specific MCE features such as
the DRAM Error Threshold.
x86, mce: use 64bit machine check code on 32bit The 64bit machine check code is in many ways much better than the 32bit machine check code: it is more specification compliant, is cleaner, only has a single code base versus one per CPU, has better infrastructure for recovery, has a cleaner way to communicate with user space etc. etc. Use the 64bit code for 32bit too. This is the second attempt to do this. There was one a couple of years ago to unify this code for 32bit and 64bit. Back then this ran into some trouble with K7s and was reverted. I believe this time the K7 problems (and some others) are addressed. I went over the old handlers and was very careful to retain all quirks. But of course this needs a lot of testing on old systems. On newer 64bit capable systems I don't expect much problems because they have been already tested with the 64bit kernel. I made this a CONFIG for now that still allows to select the old machine check code. This is mostly to make testing easier, if someone runs into a problem we can ask them to try with the CONFIG switched. The new code is default y for more coverage. Once there is confidence the 64bit code works well on older hardware too the CONFIG_X86_OLD_MCE and the associated code can be easily removed. This causes a behaviour change for 32bit installations. They now have to install the mcelog package to be able to log corrected machine checks. The 64bit machine check code only handles CPUs which support the standard Intel machine check architecture described in the IA32 SDM. The 32bit code has special support for some older CPUs which have non standard machine check architectures, in particular WinChip C3 and Intel P5. I made those a separate CONFIG option and kept them for now. The WinChip variant could be probably removed without too much pain, it doesn't really do anything interesting. P5 is also disabled by default (like it was before) because many motherboards have it miswired, but according to Alan Cox a few embedded setups use that one. Forward ported/heavily changed version of old patch, original patch included review/fixes from Thomas Gleixner, Bert Wesarg. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-04-29 01:07:31 +08:00
config X86_ANCIENT_MCE
bool "Support for old Pentium 5 / WinChip machine checks"
depends on X86_32 && X86_MCE
---help---
Include support for machine check handling on old Pentium 5 or WinChip
systems. These typically need to be enabled explicitly on the command
line.
x86, mce: use 64bit machine check code on 32bit The 64bit machine check code is in many ways much better than the 32bit machine check code: it is more specification compliant, is cleaner, only has a single code base versus one per CPU, has better infrastructure for recovery, has a cleaner way to communicate with user space etc. etc. Use the 64bit code for 32bit too. This is the second attempt to do this. There was one a couple of years ago to unify this code for 32bit and 64bit. Back then this ran into some trouble with K7s and was reverted. I believe this time the K7 problems (and some others) are addressed. I went over the old handlers and was very careful to retain all quirks. But of course this needs a lot of testing on old systems. On newer 64bit capable systems I don't expect much problems because they have been already tested with the 64bit kernel. I made this a CONFIG for now that still allows to select the old machine check code. This is mostly to make testing easier, if someone runs into a problem we can ask them to try with the CONFIG switched. The new code is default y for more coverage. Once there is confidence the 64bit code works well on older hardware too the CONFIG_X86_OLD_MCE and the associated code can be easily removed. This causes a behaviour change for 32bit installations. They now have to install the mcelog package to be able to log corrected machine checks. The 64bit machine check code only handles CPUs which support the standard Intel machine check architecture described in the IA32 SDM. The 32bit code has special support for some older CPUs which have non standard machine check architectures, in particular WinChip C3 and Intel P5. I made those a separate CONFIG option and kept them for now. The WinChip variant could be probably removed without too much pain, it doesn't really do anything interesting. P5 is also disabled by default (like it was before) because many motherboards have it miswired, but according to Alan Cox a few embedded setups use that one. Forward ported/heavily changed version of old patch, original patch included review/fixes from Thomas Gleixner, Bert Wesarg. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-04-29 01:07:31 +08:00
config X86_MCE_THRESHOLD
depends on X86_MCE_AMD || X86_MCE_INTEL
def_bool y
config X86_MCE_INJECT
depends on X86_MCE
tristate "Machine check injector support"
---help---
Provide support for injecting machine checks for testing purposes.
If you don't know what a machine check is and you don't do kernel
QA it is safe to say n.
x86, mce: use 64bit machine check code on 32bit The 64bit machine check code is in many ways much better than the 32bit machine check code: it is more specification compliant, is cleaner, only has a single code base versus one per CPU, has better infrastructure for recovery, has a cleaner way to communicate with user space etc. etc. Use the 64bit code for 32bit too. This is the second attempt to do this. There was one a couple of years ago to unify this code for 32bit and 64bit. Back then this ran into some trouble with K7s and was reverted. I believe this time the K7 problems (and some others) are addressed. I went over the old handlers and was very careful to retain all quirks. But of course this needs a lot of testing on old systems. On newer 64bit capable systems I don't expect much problems because they have been already tested with the 64bit kernel. I made this a CONFIG for now that still allows to select the old machine check code. This is mostly to make testing easier, if someone runs into a problem we can ask them to try with the CONFIG switched. The new code is default y for more coverage. Once there is confidence the 64bit code works well on older hardware too the CONFIG_X86_OLD_MCE and the associated code can be easily removed. This causes a behaviour change for 32bit installations. They now have to install the mcelog package to be able to log corrected machine checks. The 64bit machine check code only handles CPUs which support the standard Intel machine check architecture described in the IA32 SDM. The 32bit code has special support for some older CPUs which have non standard machine check architectures, in particular WinChip C3 and Intel P5. I made those a separate CONFIG option and kept them for now. The WinChip variant could be probably removed without too much pain, it doesn't really do anything interesting. P5 is also disabled by default (like it was before) because many motherboards have it miswired, but according to Alan Cox a few embedded setups use that one. Forward ported/heavily changed version of old patch, original patch included review/fixes from Thomas Gleixner, Bert Wesarg. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-04-29 01:07:31 +08:00
config X86_THERMAL_VECTOR
def_bool y
depends on X86_MCE_INTEL
x86, mce: use 64bit machine check code on 32bit The 64bit machine check code is in many ways much better than the 32bit machine check code: it is more specification compliant, is cleaner, only has a single code base versus one per CPU, has better infrastructure for recovery, has a cleaner way to communicate with user space etc. etc. Use the 64bit code for 32bit too. This is the second attempt to do this. There was one a couple of years ago to unify this code for 32bit and 64bit. Back then this ran into some trouble with K7s and was reverted. I believe this time the K7 problems (and some others) are addressed. I went over the old handlers and was very careful to retain all quirks. But of course this needs a lot of testing on old systems. On newer 64bit capable systems I don't expect much problems because they have been already tested with the 64bit kernel. I made this a CONFIG for now that still allows to select the old machine check code. This is mostly to make testing easier, if someone runs into a problem we can ask them to try with the CONFIG switched. The new code is default y for more coverage. Once there is confidence the 64bit code works well on older hardware too the CONFIG_X86_OLD_MCE and the associated code can be easily removed. This causes a behaviour change for 32bit installations. They now have to install the mcelog package to be able to log corrected machine checks. The 64bit machine check code only handles CPUs which support the standard Intel machine check architecture described in the IA32 SDM. The 32bit code has special support for some older CPUs which have non standard machine check architectures, in particular WinChip C3 and Intel P5. I made those a separate CONFIG option and kept them for now. The WinChip variant could be probably removed without too much pain, it doesn't really do anything interesting. P5 is also disabled by default (like it was before) because many motherboards have it miswired, but according to Alan Cox a few embedded setups use that one. Forward ported/heavily changed version of old patch, original patch included review/fixes from Thomas Gleixner, Bert Wesarg. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-04-29 01:07:31 +08:00
config X86_LEGACY_VM86
bool "Legacy VM86 support"
default n
depends on X86_32
---help---
This option allows user programs to put the CPU into V8086
mode, which is an 80286-era approximation of 16-bit real mode.
Some very old versions of X and/or vbetool require this option
for user mode setting. Similarly, DOSEMU will use it if
available to accelerate real mode DOS programs. However, any
recent version of DOSEMU, X, or vbetool should be fully
functional even without kernel VM86 support, as they will all
fall back to software emulation. Nevertheless, if you are using
a 16-bit DOS program where 16-bit performance matters, vm86
mode might be faster than emulation and you might want to
enable this option.
Note that any app that works on a 64-bit kernel is unlikely to
need this option, as 64-bit kernels don't, and can't, support
V8086 mode. This option is also unrelated to 16-bit protected
mode and is not needed to run most 16-bit programs under Wine.
Enabling this option increases the complexity of the kernel
and slows down exception handling a tiny bit.
If unsure, say N here.
config VM86
bool
default X86_LEGACY_VM86
config X86_16BIT
bool "Enable support for 16-bit segments" if EXPERT
default y
depends on MODIFY_LDT_SYSCALL
---help---
This option is required by programs like Wine to run 16-bit
protected mode legacy code on x86 processors. Disabling
this option saves about 300 bytes on i386, or around 6K text
plus 16K runtime memory on x86-64,
config X86_ESPFIX32
def_bool y
depends on X86_16BIT && X86_32
config X86_ESPFIX64
def_bool y
depends on X86_16BIT && X86_64
config X86_VSYSCALL_EMULATION
bool "Enable vsyscall emulation" if EXPERT
default y
depends on X86_64
---help---
This enables emulation of the legacy vsyscall page. Disabling
it is roughly equivalent to booting with vsyscall=none, except
that it will also disable the helpful warning if a program
tries to use a vsyscall. With this option set to N, offending
programs will just segfault, citing addresses of the form
0xffffffffff600?00.
This option is required by many programs built before 2013, and
care should be used even with newer programs if set to N.
Disabling this option saves about 7K of kernel size and
possibly 4K of additional runtime pagetable memory.
config TOSHIBA
tristate "Toshiba Laptop support"
depends on X86_32
---help---
This adds a driver to safely access the System Management Mode of
the CPU on Toshiba portables with a genuine Toshiba BIOS. It does
not work on models with a Phoenix BIOS. The System Management Mode
is used to set the BIOS and power saving options on Toshiba portables.
For information on utilities to make use of this driver see the
Toshiba Linux utilities web site at:
<http://www.buzzard.org.uk/toshiba/>.
Say Y if you intend to run this kernel on a Toshiba portable.
Say N otherwise.
config I8K
tristate "Dell i8k legacy laptop support"
select HWMON
select SENSORS_DELL_SMM
---help---
This option enables legacy /proc/i8k userspace interface in hwmon
dell-smm-hwmon driver. Character file /proc/i8k reports bios version,
temperature and allows controlling fan speeds of Dell laptops via
System Management Mode. For old Dell laptops (like Dell Inspiron 8000)
it reports also power and hotkey status. For fan speed control is
needed userspace package i8kutils.
Say Y if you intend to run this kernel on old Dell laptops or want to
use userspace package i8kutils.
Say N otherwise.
config X86_REBOOTFIXUPS
bool "Enable X86 board specific fixups for reboot"
depends on X86_32
---help---
This enables chipset and/or board specific fixups to be done
in order to get reboot to work correctly. This is only needed on
some combinations of hardware and BIOS. The symptom, for which
this config is intended, is when reboot ends with a stalled/hung
system.
Currently, the only fixup is for the Geode machines using
CS5530A and CS5536 chipsets and the RDC R-321x SoC.
Say Y if you want to enable the fixup. Currently, it's safe to
enable this option even if you don't need it.
Say N otherwise.
config MICROCODE
bool "CPU microcode loading support"
default y
depends on CPU_SUP_AMD || CPU_SUP_INTEL
select FW_LOADER
---help---
If you say Y here, you will be able to update the microcode on
Intel and AMD processors. The Intel support is for the IA32 family,
e.g. Pentium Pro, Pentium II, Pentium III, Pentium 4, Xeon etc. The
AMD support is for families 0x10 and later. You will obviously need
the actual microcode binary data itself which is not shipped with
the Linux kernel.
The preferred method to load microcode from a detached initrd is described
in Documentation/x86/early-microcode.txt. For that you need to enable
CONFIG_BLK_DEV_INITRD in order for the loader to be able to scan the
initrd for microcode blobs.
In addition, you can build-in the microcode into the kernel. For that you
need to enable FIRMWARE_IN_KERNEL and add the vendor-supplied microcode
to the CONFIG_EXTRA_FIRMWARE config option.
config MICROCODE_INTEL
bool "Intel microcode loading support"
depends on MICROCODE
default MICROCODE
select FW_LOADER
---help---
This options enables microcode patch loading support for Intel
processors.
For the current Intel microcode data package go to
<https://downloadcenter.intel.com> and search for
'Linux Processor Microcode Data File'.
config MICROCODE_AMD
bool "AMD microcode loading support"
depends on MICROCODE
select FW_LOADER
---help---
If you select this option, microcode patch loading support for AMD
processors will be enabled.
config MICROCODE_OLD_INTERFACE
def_bool y
depends on MICROCODE
perf/x86/amd/power: Add AMD accumulated power reporting mechanism Introduce an AMD accumlated power reporting mechanism for the Family 15h, Model 60h processor that can be used to calculate the average power consumed by a processor during a measurement interval. The feature support is indicated by CPUID Fn8000_0007_EDX[12]. This feature will be implemented both in hwmon and perf. The current design provides one event to report per package/processor power consumption by counting each compute unit power value. Here the gory details of how the computation is done: * Tsample: compute unit power accumulator sample period * Tref: the PTSC counter period (PTSC: performance timestamp counter) * N: the ratio of compute unit power accumulator sample period to the PTSC period * Jmax: max compute unit accumulated power which is indicated by MSR_C001007b[MaxCpuSwPwrAcc] * Jx/Jy: compute unit accumulated power which is indicated by MSR_C001007a[CpuSwPwrAcc] * Tx/Ty: the value of performance timestamp counter which is indicated by CU_PTSC MSR_C0010280[PTSC] * PwrCPUave: CPU average power i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007. N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]]. ii. Read the full range of the cumulative energy value from the new MSR MaxCpuSwPwrAcc. Jmax = value returned. iii. At time x, software reads CpuSwPwrAcc and samples the PTSC. Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC. iv. At time y, software reads CpuSwPwrAcc and samples the PTSC. Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC. v. Calculate the average power consumption for a compute unit over time period (y-x). Unit of result is uWatt: if (Jy < Jx) // Rollover has occurred Jdelta = (Jy + Jmax) - Jx else Jdelta = Jy - Jx PwrCPUave = N * Jdelta * 1000 / (Ty - Tx) Simple example: root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4 CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h CHK include/generated/timeconst.h CHK include/generated/bounds.h CHK include/generated/asm-offsets.h CALL scripts/checksyscalls.sh CHK include/generated/compile.h SKIPPED include/generated/compile.h Building modules, stage 2. Kernel: arch/x86/boot/bzImage is ready (#40) MODPOST 4225 modules Performance counter stats for 'system wide': 183.44 mWatts power/power-pkg/ 341.837270111 seconds time elapsed root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10 Performance counter stats for 'system wide': 0.18 mWatts power/power-pkg/ 10.012551815 seconds time elapsed Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: jacob.w.shin@gmail.com Link: http://lkml.kernel.org/r/1457502306-2559-1-git-send-email-ray.huang@amd.com [ Fixed the modular build. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-09 13:45:06 +08:00
config PERF_EVENTS_AMD_POWER
depends on PERF_EVENTS && CPU_SUP_AMD
tristate "AMD Processor Power Reporting Mechanism"
---help---
Provide power reporting mechanism support for AMD processors.
Currently, it leverages X86_FEATURE_ACC_POWER
(CPUID Fn8000_0007_EDX[12]) interface to calculate the
average power consumption on Family 15h processors.
config X86_MSR
tristate "/dev/cpu/*/msr - Model-specific register support"
---help---
This device gives privileged processes access to the x86
Model-Specific Registers (MSRs). It is a character device with
major 202 and minors 0 to 31 for /dev/cpu/0/msr to /dev/cpu/31/msr.
MSR accesses are directed to a specific CPU on multi-processor
systems.
config X86_CPUID
tristate "/dev/cpu/*/cpuid - CPU information support"
---help---
This device gives processes access to the x86 CPUID instruction to
be executed on a specific processor. It is a character device
with major 203 and minors 0 to 31 for /dev/cpu/0/cpuid to
/dev/cpu/31/cpuid.
choice
prompt "High Memory Support"
default HIGHMEM4G
depends on X86_32
config NOHIGHMEM
bool "off"
---help---
Linux can use up to 64 Gigabytes of physical memory on x86 systems.
However, the address space of 32-bit x86 processors is only 4
Gigabytes large. That means that, if you have a large amount of
physical memory, not all of it can be "permanently mapped" by the
kernel. The physical memory that's not permanently mapped is called
"high memory".
If you are compiling a kernel which will never run on a machine with
more than 1 Gigabyte total physical RAM, answer "off" here (default
choice and suitable for most users). This will result in a "3GB/1GB"
split: 3GB are mapped so that each process sees a 3GB virtual memory
space and the remaining part of the 4GB virtual memory space is used
by the kernel to permanently map as much physical memory as
possible.
If the machine has between 1 and 4 Gigabytes physical RAM, then
answer "4GB" here.
If more than 4 Gigabytes is used then answer "64GB" here. This
selection turns Intel PAE (Physical Address Extension) mode on.
PAE implements 3-level paging on IA32 processors. PAE is fully
supported by Linux, PAE mode is implemented on all recent Intel
processors (Pentium Pro and better). NOTE: If you say "64GB" here,
then the kernel will not boot on CPUs that don't support PAE!
The actual amount of total physical memory will either be
auto detected or can be forced by using a kernel command line option
such as "mem=256M". (Try "man bootparam" or see the documentation of
your boot loader (lilo or loadlin) about how to pass options to the
kernel at boot time.)
If unsure, say "off".
config HIGHMEM4G
bool "4GB"
---help---
Select this if you have a 32-bit processor and between 1 and 4
gigabytes of physical RAM.
config HIGHMEM64G
bool "64GB"
depends on !M486
select X86_PAE
---help---
Select this if you have a 32-bit processor and more than 4
gigabytes of physical RAM.
endchoice
choice
prompt "Memory split" if EXPERT
default VMSPLIT_3G
depends on X86_32
---help---
Select the desired split between kernel and user memory.
If the address range available to the kernel is less than the
physical memory installed, the remaining memory will be available
as "high memory". Accessing high memory is a little more costly
than low memory, as it needs to be mapped into the kernel first.
Note that increasing the kernel address space limits the range
available to user programs, making the address space there
tighter. Selecting anything other than the default 3G/1G split
will also likely make your kernel incompatible with binary-only
kernel modules.
If you are not absolutely sure what you are doing, leave this
option alone!
config VMSPLIT_3G
bool "3G/1G user/kernel split"
config VMSPLIT_3G_OPT
depends on !X86_PAE
bool "3G/1G user/kernel split (for full 1G low memory)"
config VMSPLIT_2G
bool "2G/2G user/kernel split"
config VMSPLIT_2G_OPT
depends on !X86_PAE
bool "2G/2G user/kernel split (for full 2G low memory)"
config VMSPLIT_1G
bool "1G/3G user/kernel split"
endchoice
config PAGE_OFFSET
hex
default 0xB0000000 if VMSPLIT_3G_OPT
default 0x80000000 if VMSPLIT_2G
default 0x78000000 if VMSPLIT_2G_OPT
default 0x40000000 if VMSPLIT_1G
default 0xC0000000
depends on X86_32
config HIGHMEM
def_bool y
depends on X86_32 && (HIGHMEM64G || HIGHMEM4G)
config X86_PAE
bool "PAE (Physical Address Extension) Support"
depends on X86_32 && !HIGHMEM4G
select SWIOTLB
---help---
PAE is required for NX support, and furthermore enables
larger swapspace support for non-overcommit purposes. It
has the cost of more pagetable lookup overhead, and also
consumes more pagetable space per process.
config ARCH_PHYS_ADDR_T_64BIT
def_bool y
depends on X86_64 || X86_PAE
config ARCH_DMA_ADDR_T_64BIT
def_bool y
depends on X86_64 || HIGHMEM64G
config X86_DIRECT_GBPAGES
x86/mm: Simplify enabling direct_gbpages direct_gbpages can be force enabled as an early parameter but not really have taken effect when DEBUG_PAGEALLOC or KMEMCHECK is enabled. You can also enable direct_gbpages right now if you have an x86_64 architecture but your CPU doesn't really have support for this feature. In both cases PG_LEVEL_1G won't actually be enabled but direct_gbpages is used in other areas under the assumptions that PG_LEVEL_1G was set. Fix this by putting together all requirements which make this feature sensible to enable under, and only enable both finally flipping on PG_LEVEL_1G and leaving PG_LEVEL_1G set when this is true. We only enable this feature then to be possible on sensible builds defined by the new ENABLE_DIRECT_GBPAGES. If the CPU has support for it you can either enable this by using the DIRECT_GBPAGES option or using the early kernel parameter. If a platform had support for this you can always force disable it as well. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: JBeulich@suse.com Cc: Jan Beulich <JBeulich@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: julia.lawall@lip6.fr Link: http://lkml.kernel.org/r/1425518654-3403-3-git-send-email-mcgrof@do-not-panic.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-05 09:24:12 +08:00
def_bool y
depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
---help---
Certain kernel features effectively disable kernel
linear 1 GB mappings (even if the CPU otherwise
supports them), so don't confuse the user by printing
that we have them enabled.
# Common NUMA Features
config NUMA
bool "Numa Memory Allocation and Scheduler Support"
depends on SMP
depends on X86_64 || (X86_32 && HIGHMEM64G && X86_BIGSMP)
default y if X86_BIGSMP
---help---
Enable NUMA (Non Uniform Memory Access) support.
The kernel will try to allocate memory used by a CPU on the
local memory controller of the CPU and add some more
NUMA awareness to the kernel.
For 64-bit this is recommended if the system is Intel Core i7
(or later), AMD Opteron, or EM64T NUMA.
For 32-bit this is only needed if you boot a 32-bit
kernel on a 64-bit NUMA platform.
Otherwise, you should say N.
config AMD_NUMA
def_bool y
prompt "Old style AMD Opteron NUMA detection"
depends on X86_64 && NUMA && PCI
---help---
Enable AMD NUMA node topology detection. You should say Y here if
you have a multi processor AMD system. This uses an old method to
read the NUMA configuration directly from the builtin Northbridge
of Opteron. It is recommended to use X86_64_ACPI_NUMA instead,
which also takes priority if both are compiled in.
config X86_64_ACPI_NUMA
def_bool y
prompt "ACPI NUMA detection"
depends on X86_64 && NUMA && ACPI && PCI
select ACPI_NUMA
---help---
Enable ACPI SRAT based node topology detection.
# Some NUMA nodes have memory ranges that span
# other nodes. Even though a pfn is valid and
# between a node's start and end pfns, it may not
# reside on that node. See memmap_init_zone()
# for details.
config NODES_SPAN_OTHER_NODES
def_bool y
depends on X86_64_ACPI_NUMA
config NUMA_EMU
bool "NUMA emulation"
depends on NUMA
---help---
Enable NUMA emulation. A flat machine will be split
into virtual nodes when booted with "numa=fake=N", where N is the
number of nodes. This is only useful for debugging.
config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
range 1 10
default "10" if MAXSMP
default "6" if X86_64
default "3"
depends on NEED_MULTIPLE_NODES
---help---
Specify the maximum number of NUMA Nodes available on the target
system. Increases memory reserved to accommodate various tables.
config ARCH_HAVE_MEMORY_PRESENT
def_bool y
depends on X86_32 && DISCONTIGMEM
config NEED_NODE_MEMMAP_SIZE
def_bool y
depends on X86_32 && (DISCONTIGMEM || SPARSEMEM)
config ARCH_FLATMEM_ENABLE
def_bool y
depends on X86_32 && !NUMA
config ARCH_DISCONTIGMEM_ENABLE
def_bool y
x86: 64-bit, make sparsemem vmemmap the only memory model Use sparsemem as the only memory model for UP, SMP and NUMA. Measurements indicate that DISCONTIGMEM has a higher overhead than sparsemem. And FLATMEMs benefits are minimal. So I think its best to simply standardize on sparsemem. Results of page allocator tests (test can be had via git from slab git tree branch tests) Measurements in cycle counts. 1000 allocations were performed and then the average cycle count was calculated. Order FlatMem Discontig SparseMem 0 639 665 641 1 567 647 593 2 679 774 692 3 763 967 781 4 961 1501 962 5 1356 2344 1392 6 2224 3982 2336 7 4869 7225 5074 8 12500 14048 12732 9 27926 28223 28165 10 58578 58714 58682 (Note that FlatMem is an SMP config and the rest NUMA configurations) Memory use: SMP Sparsemem ------------- Kernel size: text data bss dec hex filename 3849268 397739 1264856 5511863 541ab7 vmlinux total used free shared buffers cached Mem: 8242252 41164 8201088 0 352 11512 -/+ buffers/cache: 29300 8212952 Swap: 9775512 0 9775512 SMP Flatmem ----------- Kernel size: text data bss dec hex filename 3844612 397739 1264536 5506887 540747 vmlinux So 4.5k growth in text size vs. FLATMEM. total used free shared buffers cached Mem: 8244052 40544 8203508 0 352 11484 -/+ buffers/cache: 28708 8215344 2k growth in overall memory use after boot. NUMA discontig: text data bss dec hex filename 3888124 470659 1276504 5635287 55fcd7 vmlinux total used free shared buffers cached Mem: 8256256 56908 8199348 0 352 11496 -/+ buffers/cache: 45060 8211196 Swap: 9775512 0 9775512 NUMA sparse: text data bss dec hex filename 3896428 470659 1276824 5643911 561e87 vmlinux 8k text growth. Given that we fully inline virt_to_page and friends now that is rather good. total used free shared buffers cached Mem: 8264720 57240 8207480 0 352 11516 -/+ buffers/cache: 45372 8219348 Swap: 9775512 0 9775512 The total available memory is increased by 8k. This patch makes sparsemem the default and removes discontig and flatmem support from x86. [ akpm@linux-foundation.org: allnoconfig build fix ] Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 20:30:47 +08:00
depends on NUMA && X86_32
config ARCH_DISCONTIGMEM_DEFAULT
def_bool y
x86: 64-bit, make sparsemem vmemmap the only memory model Use sparsemem as the only memory model for UP, SMP and NUMA. Measurements indicate that DISCONTIGMEM has a higher overhead than sparsemem. And FLATMEMs benefits are minimal. So I think its best to simply standardize on sparsemem. Results of page allocator tests (test can be had via git from slab git tree branch tests) Measurements in cycle counts. 1000 allocations were performed and then the average cycle count was calculated. Order FlatMem Discontig SparseMem 0 639 665 641 1 567 647 593 2 679 774 692 3 763 967 781 4 961 1501 962 5 1356 2344 1392 6 2224 3982 2336 7 4869 7225 5074 8 12500 14048 12732 9 27926 28223 28165 10 58578 58714 58682 (Note that FlatMem is an SMP config and the rest NUMA configurations) Memory use: SMP Sparsemem ------------- Kernel size: text data bss dec hex filename 3849268 397739 1264856 5511863 541ab7 vmlinux total used free shared buffers cached Mem: 8242252 41164 8201088 0 352 11512 -/+ buffers/cache: 29300 8212952 Swap: 9775512 0 9775512 SMP Flatmem ----------- Kernel size: text data bss dec hex filename 3844612 397739 1264536 5506887 540747 vmlinux So 4.5k growth in text size vs. FLATMEM. total used free shared buffers cached Mem: 8244052 40544 8203508 0 352 11484 -/+ buffers/cache: 28708 8215344 2k growth in overall memory use after boot. NUMA discontig: text data bss dec hex filename 3888124 470659 1276504 5635287 55fcd7 vmlinux total used free shared buffers cached Mem: 8256256 56908 8199348 0 352 11496 -/+ buffers/cache: 45060 8211196 Swap: 9775512 0 9775512 NUMA sparse: text data bss dec hex filename 3896428 470659 1276824 5643911 561e87 vmlinux 8k text growth. Given that we fully inline virt_to_page and friends now that is rather good. total used free shared buffers cached Mem: 8264720 57240 8207480 0 352 11516 -/+ buffers/cache: 45372 8219348 Swap: 9775512 0 9775512 The total available memory is increased by 8k. This patch makes sparsemem the default and removes discontig and flatmem support from x86. [ akpm@linux-foundation.org: allnoconfig build fix ] Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 20:30:47 +08:00
depends on NUMA && X86_32
config ARCH_SPARSEMEM_ENABLE
def_bool y
depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
select SPARSEMEM_STATIC if X86_32
select SPARSEMEM_VMEMMAP_ENABLE if X86_64
config ARCH_SPARSEMEM_DEFAULT
def_bool y
depends on X86_64
config ARCH_SELECT_MEMORY_MODEL
def_bool y
x86: 64-bit, make sparsemem vmemmap the only memory model Use sparsemem as the only memory model for UP, SMP and NUMA. Measurements indicate that DISCONTIGMEM has a higher overhead than sparsemem. And FLATMEMs benefits are minimal. So I think its best to simply standardize on sparsemem. Results of page allocator tests (test can be had via git from slab git tree branch tests) Measurements in cycle counts. 1000 allocations were performed and then the average cycle count was calculated. Order FlatMem Discontig SparseMem 0 639 665 641 1 567 647 593 2 679 774 692 3 763 967 781 4 961 1501 962 5 1356 2344 1392 6 2224 3982 2336 7 4869 7225 5074 8 12500 14048 12732 9 27926 28223 28165 10 58578 58714 58682 (Note that FlatMem is an SMP config and the rest NUMA configurations) Memory use: SMP Sparsemem ------------- Kernel size: text data bss dec hex filename 3849268 397739 1264856 5511863 541ab7 vmlinux total used free shared buffers cached Mem: 8242252 41164 8201088 0 352 11512 -/+ buffers/cache: 29300 8212952 Swap: 9775512 0 9775512 SMP Flatmem ----------- Kernel size: text data bss dec hex filename 3844612 397739 1264536 5506887 540747 vmlinux So 4.5k growth in text size vs. FLATMEM. total used free shared buffers cached Mem: 8244052 40544 8203508 0 352 11484 -/+ buffers/cache: 28708 8215344 2k growth in overall memory use after boot. NUMA discontig: text data bss dec hex filename 3888124 470659 1276504 5635287 55fcd7 vmlinux total used free shared buffers cached Mem: 8256256 56908 8199348 0 352 11496 -/+ buffers/cache: 45060 8211196 Swap: 9775512 0 9775512 NUMA sparse: text data bss dec hex filename 3896428 470659 1276824 5643911 561e87 vmlinux 8k text growth. Given that we fully inline virt_to_page and friends now that is rather good. total used free shared buffers cached Mem: 8264720 57240 8207480 0 352 11516 -/+ buffers/cache: 45372 8219348 Swap: 9775512 0 9775512 The total available memory is increased by 8k. This patch makes sparsemem the default and removes discontig and flatmem support from x86. [ akpm@linux-foundation.org: allnoconfig build fix ] Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 20:30:47 +08:00
depends on ARCH_SPARSEMEM_ENABLE
config ARCH_MEMORY_PROBE
2013-07-20 01:47:48 +08:00
bool "Enable sysfs memory/probe interface"
depends on X86_64 && MEMORY_HOTPLUG
2013-07-20 01:47:48 +08:00
help
This option enables a sysfs memory/probe interface for testing.
See Documentation/memory-hotplug.txt for more information.
If you are unsure how to answer this question, answer N.
config ARCH_PROC_KCORE_TEXT
def_bool y
depends on X86_64 && PROC_KCORE
config ILLEGAL_POINTER_VALUE
hex
default 0 if X86_32
default 0xdead000000000000 if X86_64
source "mm/Kconfig"
config X86_PMEM_LEGACY_DEVICE
bool
config X86_PMEM_LEGACY
tristate "Support non-standard NVDIMMs and ADR protected memory"
depends on PHYS_ADDR_T_64BIT
depends on BLK_DEV
select X86_PMEM_LEGACY_DEVICE
select LIBNVDIMM
help
Treat memory marked using the non-standard e820 type of 12 as used
by the Intel Sandy Bridge-EP reference BIOS as protected memory.
The kernel will offer these regions to the 'pmem' driver so
they can be used for persistent storage.
Say Y if unsure.
config HIGHPTE
bool "Allocate 3rd-level pagetables from highmem"
depends on HIGHMEM
---help---
The VM uses one page table entry for each page of physical memory.
For systems with a lot of RAM, this can be wasteful of precious
low memory. Setting this option will put user-space page table
entries in high memory.
config X86_CHECK_BIOS_CORRUPTION
bool "Check for low memory corruption"
---help---
Periodically check for memory corruption in low memory, which
is suspected to be caused by BIOS. Even when enabled in the
configuration, it is disabled at runtime. Enable it by
setting "memory_corruption_check=1" on the kernel command
line. By default it scans the low 64k of memory every 60
seconds; see the memory_corruption_check_size and
memory_corruption_check_period parameters in
Documentation/kernel-parameters.txt to adjust this.
When enabled with the default parameters, this option has
almost no overhead, as it reserves a relatively small amount
of memory and scans it infrequently. It both detects corruption
and prevents it from affecting the running system.
It is, however, intended as a diagnostic tool; if repeatable
BIOS-originated corruption always affects the same memory,
you can use memmap= to prevent the kernel from using that
memory.
config X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK
bool "Set the default setting of memory_corruption_check"
depends on X86_CHECK_BIOS_CORRUPTION
default y
---help---
Set whether the default state of memory_corruption_check is
on or off.
config X86_RESERVE_LOW
int "Amount of low memory, in kilobytes, to reserve for the BIOS"
default 64
range 4 640
---help---
Specify the amount of low memory to reserve for the BIOS.
The first page contains BIOS data structures that the kernel
must not use, so that page must always be reserved.
By default we reserve the first 64K of physical RAM, as a
number of BIOSes are known to corrupt that memory range
during events such as suspend/resume or monitor cable
insertion, so it must not be used by the kernel.
You can set this to 4 if you are absolutely sure that you
trust the BIOS to get all its memory reservations and usages
right. If you know your BIOS have problems beyond the
default 64K area, you can set this to 640 to avoid using the
entire low memory range.
If you have doubts about the BIOS (e.g. suspend/resume does
not work or there's kernel crashes after certain hardware
hotplug events) then you might want to enable
X86_CHECK_BIOS_CORRUPTION=y to allow the kernel to check
typical corruption patterns.
Leave this to the default value of 64 if you are unsure.
config MATH_EMULATION
bool
depends on MODIFY_LDT_SYSCALL
prompt "Math emulation" if X86_32
---help---
Linux can emulate a math coprocessor (used for floating point
operations) if you don't have one. 486DX and Pentium processors have
a math coprocessor built in, 486SX and 386 do not, unless you added
a 487DX or 387, respectively. (The messages during boot time can
give you some hints here ["man dmesg"].) Everyone needs either a
coprocessor or this emulation.
If you don't have a math coprocessor, you need to say Y here; if you
say Y here even though you have a coprocessor, the coprocessor will
be used nevertheless. (This behavior can be changed with the kernel
command line option "no387", which comes handy if your coprocessor
is broken. Try "man bootparam" or see the documentation of your boot
loader (lilo or loadlin) about how to pass options to the kernel at
boot time.) This means that it is a good idea to say Y here if you
intend to use this kernel on different machines.
More information about the internals of the Linux math coprocessor
emulation can be found in <file:arch/x86/math-emu/README>.
If you are not sure, say Y; apart from resulting in a 66 KB bigger
kernel, it won't hurt.
config MTRR
def_bool y
prompt "MTRR (Memory Type Range Register) support" if EXPERT
---help---
On Intel P6 family processors (Pentium Pro, Pentium II and later)
the Memory Type Range Registers (MTRRs) may be used to control
processor access to memory ranges. This is most useful if you have
a video (VGA) card on a PCI or AGP bus. Enabling write-combining
allows bus write transfers to be combined into a larger transfer
before bursting over the PCI/AGP bus. This can increase performance
of image write operations 2.5 times or more. Saying Y here creates a
/proc/mtrr file which may be used to manipulate your processor's
MTRRs. Typically the X server should use this.
This code has a reasonably generic interface so that similar
control registers on other processors can be easily supported
as well:
The Cyrix 6x86, 6x86MX and M II processors have Address Range
Registers (ARRs) which provide a similar functionality to MTRRs. For
these, the ARRs are used to emulate the MTRRs.
The AMD K6-2 (stepping 8 and above) and K6-3 processors have two
MTRRs. The Centaur C6 (WinChip) has 8 MCRs, allowing
write-combining. All of these processors are supported by this code
and it makes sense to say Y here if you have one of them.
Saying Y here also fixes a problem with buggy SMP BIOSes which only
set the MTRRs for the boot CPU and not for the secondary CPUs. This
can lead to all sorts of problems, so it's good to say Y here.
You can safely say Y even if your machine doesn't have MTRRs, you'll
just add about 9 KB to your kernel.
See <file:Documentation/x86/mtrr.txt> for more information.
config MTRR_SANITIZER
x86: change MTRR_SANITIZER to def_bool y This option has been added in v2.6.26 as a default-disabled feature and went through several revisions since then. The feature fixes a wide range of MTRR setup problems that BIOSes leave us with: slow system, slow Xorg, slow system when adding lots of RAM, etc., so we want to enable it by default for v2.6.28. See: [Bug 10508] Upgrade to 4GB of RAM messes up MTRRs http://bugzilla.kernel.org/show_bug.cgi?id=10508 and the test results in: http://lkml.org/lkml/2008/9/29/273 1. hpa reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1 reg01: base=0x13c000000 (5056MB), size= 64MB: uncachable, count=1 reg02: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1 reg03: base=0x100000000 (4096MB), size=1024MB: write-back, count=1 reg04: base=0xbf700000 (3063MB), size= 1MB: uncachable, count=1 reg05: base=0xbf800000 (3064MB), size= 8MB: uncachable, count=1 will get Found optimal setting for mtrr clean up gran_size: 1M chunk_size: 128M num_reg: 6 lose RAM: 0M range0: 0000000000000000 - 00000000c0000000 Setting variable MTRR 0, base: 0MB, range: 2048MB, type WB Setting variable MTRR 1, base: 2048MB, range: 1024MB, type WB hole: 00000000bf700000 - 00000000c0000000 Setting variable MTRR 2, base: 3063MB, range: 1MB, type UC Setting variable MTRR 3, base: 3064MB, range: 8MB, type UC range0: 0000000100000000 - 0000000140000000 Setting variable MTRR 4, base: 4096MB, range: 1024MB, type WB hole: 000000013c000000 - 0000000140000000 Setting variable MTRR 5, base: 5056MB, range: 64MB, type UC 2. Dylan Taft reg00: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1 reg01: base=0x100000000 (4096MB), size= 512MB: write-back, count=1 reg02: base=0x120000000 (4608MB), size= 256MB: write-back, count=1 reg03: base=0xd0000000 (3328MB), size= 256MB: uncachable, count=1 reg04: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1 reg05: base=0xc7e00000 (3198MB), size= 2MB: uncachable, count=1 reg06: base=0xc8000000 (3200MB), size= 128MB: uncachable, count=1 will get Found optimal setting for mtrr clean up gran_size: 1M chunk_size: 4M num_reg: 6 lose RAM: 0M range0: 0000000000000000 - 00000000c8000000 Setting variable MTRR 0, base: 0MB, range: 2048MB, type WB Setting variable MTRR 1, base: 2048MB, range: 1024MB, type WB Setting variable MTRR 2, base: 3072MB, range: 128MB, type WB hole: 00000000c7e00000 - 00000000c8000000 Setting variable MTRR 3, base: 3198MB, range: 2MB, type UC rangeX: 0000000100000000 - 0000000130000000 Setting variable MTRR 4, base: 4096MB, range: 512MB, type WB Setting variable MTRR 5, base: 4608MB, range: 256MB, type WB 3. Gabriel reg00: base=0xd0000000 (3328MB), size= 256MB: uncachable, count=1 reg01: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1 reg02: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1 reg03: base=0x100000000 (4096MB), size= 512MB: write-back, count=1 reg04: base=0x120000000 (4608MB), size= 128MB: write-back, count=1 reg05: base=0x128000000 (4736MB), size= 64MB: write-back, count=1 reg06: base=0xcf600000 (3318MB), size= 2MB: uncachable, count=1 will get Found optimal setting for mtrr clean up gran_size: 1M chunk_size: 16M num_reg: 7 lose RAM: 0M range0: 0000000000000000 - 00000000d0000000 Setting variable MTRR 0, base: 0MB, range: 2048MB, type WB Setting variable MTRR 1, base: 2048MB, range: 1024MB, type WB Setting variable MTRR 2, base: 3072MB, range: 256MB, type WB hole: 00000000cf600000 - 00000000cf800000 Setting variable MTRR 3, base: 3318MB, range: 2MB, type UC rangeX: 0000000100000000 - 000000012c000000 Setting variable MTRR 4, base: 4096MB, range: 512MB, type WB Setting variable MTRR 5, base: 4608MB, range: 128MB, type WB Setting variable MTRR 6, base: 4736MB, range: 64MB, type WB 4. Mika Fischer reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1 reg01: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1 reg02: base=0x100000000 (4096MB), size=1024MB: write-back, count=1 reg03: base=0xbf700000 (3063MB), size= 1MB: uncachable, count=1 reg04: base=0xbf800000 (3064MB), size= 8MB: uncachable, count=1 will get Found optimal setting for mtrr clean up gran_size: 1M chunk_size: 16M num_reg: 5 lose RAM: 0M range0: 0000000000000000 - 00000000c0000000 Setting variable MTRR 0, base: 0MB, range: 2048MB, type WB Setting variable MTRR 1, base: 2048MB, range: 1024MB, type WB hole: 00000000bf700000 - 00000000c0000000 Setting variable MTRR 2, base: 3063MB, range: 1MB, type UC Setting variable MTRR 3, base: 3064MB, range: 8MB, type UC rangeX: 0000000100000000 - 0000000140000000 Setting variable MTRR 4, base: 4096MB, range: 1024MB, type WB Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-01 07:29:40 +08:00
def_bool y
prompt "MTRR cleanup support"
depends on MTRR
---help---
Convert MTRR layout from continuous to discrete, so X drivers can
add writeback entries.
Can be disabled with disable_mtrr_cleanup on the kernel command line.
The largest mtrr entry size for a continuous block can be set with
mtrr_chunk_size.
x86: change MTRR_SANITIZER to def_bool y This option has been added in v2.6.26 as a default-disabled feature and went through several revisions since then. The feature fixes a wide range of MTRR setup problems that BIOSes leave us with: slow system, slow Xorg, slow system when adding lots of RAM, etc., so we want to enable it by default for v2.6.28. See: [Bug 10508] Upgrade to 4GB of RAM messes up MTRRs http://bugzilla.kernel.org/show_bug.cgi?id=10508 and the test results in: http://lkml.org/lkml/2008/9/29/273 1. hpa reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1 reg01: base=0x13c000000 (5056MB), size= 64MB: uncachable, count=1 reg02: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1 reg03: base=0x100000000 (4096MB), size=1024MB: write-back, count=1 reg04: base=0xbf700000 (3063MB), size= 1MB: uncachable, count=1 reg05: base=0xbf800000 (3064MB), size= 8MB: uncachable, count=1 will get Found optimal setting for mtrr clean up gran_size: 1M chunk_size: 128M num_reg: 6 lose RAM: 0M range0: 0000000000000000 - 00000000c0000000 Setting variable MTRR 0, base: 0MB, range: 2048MB, type WB Setting variable MTRR 1, base: 2048MB, range: 1024MB, type WB hole: 00000000bf700000 - 00000000c0000000 Setting variable MTRR 2, base: 3063MB, range: 1MB, type UC Setting variable MTRR 3, base: 3064MB, range: 8MB, type UC range0: 0000000100000000 - 0000000140000000 Setting variable MTRR 4, base: 4096MB, range: 1024MB, type WB hole: 000000013c000000 - 0000000140000000 Setting variable MTRR 5, base: 5056MB, range: 64MB, type UC 2. Dylan Taft reg00: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1 reg01: base=0x100000000 (4096MB), size= 512MB: write-back, count=1 reg02: base=0x120000000 (4608MB), size= 256MB: write-back, count=1 reg03: base=0xd0000000 (3328MB), size= 256MB: uncachable, count=1 reg04: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1 reg05: base=0xc7e00000 (3198MB), size= 2MB: uncachable, count=1 reg06: base=0xc8000000 (3200MB), size= 128MB: uncachable, count=1 will get Found optimal setting for mtrr clean up gran_size: 1M chunk_size: 4M num_reg: 6 lose RAM: 0M range0: 0000000000000000 - 00000000c8000000 Setting variable MTRR 0, base: 0MB, range: 2048MB, type WB Setting variable MTRR 1, base: 2048MB, range: 1024MB, type WB Setting variable MTRR 2, base: 3072MB, range: 128MB, type WB hole: 00000000c7e00000 - 00000000c8000000 Setting variable MTRR 3, base: 3198MB, range: 2MB, type UC rangeX: 0000000100000000 - 0000000130000000 Setting variable MTRR 4, base: 4096MB, range: 512MB, type WB Setting variable MTRR 5, base: 4608MB, range: 256MB, type WB 3. Gabriel reg00: base=0xd0000000 (3328MB), size= 256MB: uncachable, count=1 reg01: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1 reg02: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1 reg03: base=0x100000000 (4096MB), size= 512MB: write-back, count=1 reg04: base=0x120000000 (4608MB), size= 128MB: write-back, count=1 reg05: base=0x128000000 (4736MB), size= 64MB: write-back, count=1 reg06: base=0xcf600000 (3318MB), size= 2MB: uncachable, count=1 will get Found optimal setting for mtrr clean up gran_size: 1M chunk_size: 16M num_reg: 7 lose RAM: 0M range0: 0000000000000000 - 00000000d0000000 Setting variable MTRR 0, base: 0MB, range: 2048MB, type WB Setting variable MTRR 1, base: 2048MB, range: 1024MB, type WB Setting variable MTRR 2, base: 3072MB, range: 256MB, type WB hole: 00000000cf600000 - 00000000cf800000 Setting variable MTRR 3, base: 3318MB, range: 2MB, type UC rangeX: 0000000100000000 - 000000012c000000 Setting variable MTRR 4, base: 4096MB, range: 512MB, type WB Setting variable MTRR 5, base: 4608MB, range: 128MB, type WB Setting variable MTRR 6, base: 4736MB, range: 64MB, type WB 4. Mika Fischer reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1 reg01: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1 reg02: base=0x100000000 (4096MB), size=1024MB: write-back, count=1 reg03: base=0xbf700000 (3063MB), size= 1MB: uncachable, count=1 reg04: base=0xbf800000 (3064MB), size= 8MB: uncachable, count=1 will get Found optimal setting for mtrr clean up gran_size: 1M chunk_size: 16M num_reg: 5 lose RAM: 0M range0: 0000000000000000 - 00000000c0000000 Setting variable MTRR 0, base: 0MB, range: 2048MB, type WB Setting variable MTRR 1, base: 2048MB, range: 1024MB, type WB hole: 00000000bf700000 - 00000000c0000000 Setting variable MTRR 2, base: 3063MB, range: 1MB, type UC Setting variable MTRR 3, base: 3064MB, range: 8MB, type UC rangeX: 0000000100000000 - 0000000140000000 Setting variable MTRR 4, base: 4096MB, range: 1024MB, type WB Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-10-01 07:29:40 +08:00
If unsure, say Y.
config MTRR_SANITIZER_ENABLE_DEFAULT
int "MTRR cleanup enable value (0-1)"
range 0 1
default "0"
depends on MTRR_SANITIZER
---help---
Enable mtrr cleanup default value
config MTRR_SANITIZER_SPARE_REG_NR_DEFAULT
int "MTRR cleanup spare reg num (0-7)"
range 0 7
default "1"
depends on MTRR_SANITIZER
---help---
mtrr cleanup spare entries default, it can be changed via
mtrr_spare_reg_nr=N on the kernel command line.
config X86_PAT
def_bool y
prompt "x86 PAT support" if EXPERT
depends on MTRR
---help---
Use PAT attributes to setup page level cache control.
PATs are the modern equivalents of MTRRs and are much more
flexible than MTRRs.
Say N here if you see bootup problems (boot crash, boot hang,
spontaneous reboots) or a non-working video driver.
If unsure, say Y.
config ARCH_USES_PG_UNCACHED
def_bool y
depends on X86_PAT
x86, random: Architectural inlines to get random integers with RDRAND Architectural inlines to get random ints and longs using the RDRAND instruction. Intel has introduced a new RDRAND instruction, a Digital Random Number Generator (DRNG), which is functionally an high bandwidth entropy source, cryptographic whitener, and integrity monitor all built into hardware. This enables RDRAND to be used directly, bypassing the kernel random number pool. For technical documentation, see: http://software.intel.com/en-us/articles/download-the-latest-bull-mountain-software-implementation-guide/ In this patch, this is *only* used for the nonblocking random number pool. RDRAND is a nonblocking source, similar to our /dev/urandom, and is therefore not a direct replacement for /dev/random. The architectural hooks presented in the previous patch only feed the kernel internal users, which only use the nonblocking pool, and so this is not a problem. Since this instruction is available in userspace, there is no reason to have a /dev/hw_rng device driver for the purpose of feeding rngd. This is especially so since RDRAND is a nonblocking source, and needs additional whitening and reduction (see the above technical documentation for details) in order to be of "pure entropy source" quality. The CONFIG_EXPERT compile-time option can be used to disable this use of RDRAND. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Originally-by: Fenghua Yu <fenghua.yu@intel.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 04:59:29 +08:00
config ARCH_RANDOM
def_bool y
prompt "x86 architectural random number generator" if EXPERT
---help---
Enable the x86 architectural RDRAND instruction
(Intel Bull Mountain technology) to generate random numbers.
If supported, this is a high bandwidth, cryptographically
secure hardware random number generator.
config X86_SMAP
def_bool y
prompt "Supervisor Mode Access Prevention" if EXPERT
---help---
Supervisor Mode Access Prevention (SMAP) is a security
feature in newer Intel processors. There is a small
performance cost if this enabled and turned on; there is
also a small increase in the kernel size if this is enabled.
If unsure, say Y.
config X86_INTEL_MPX
prompt "Intel MPX (Memory Protection Extensions)"
def_bool n
depends on CPU_SUP_INTEL
---help---
MPX provides hardware features that can be used in
conjunction with compiler-instrumented code to check
memory references. It is designed to detect buffer
overflow or underflow bugs.
This option enables running applications which are
instrumented or otherwise use MPX. It does not use MPX
itself inside the kernel or to protect the kernel
against bad memory references.
Enabling this option will make the kernel larger:
~8k of kernel text and 36 bytes of data on a 64-bit
defconfig. It adds a long to the 'mm_struct' which
will increase the kernel memory overhead of each
process and adds some branches to paths used during
exec() and munmap().
For details, see Documentation/x86/intel_mpx.txt
If unsure, say N.
config X86_INTEL_MEMORY_PROTECTION_KEYS
prompt "Intel Memory Protection Keys"
def_bool y
# Note: only available in 64-bit mode
depends on CPU_SUP_INTEL && X86_64
---help---
Memory Protection Keys provides a mechanism for enforcing
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.
For details, see Documentation/x86/protection-keys.txt
If unsure, say y.
config EFI
bool "EFI runtime service support"
depends on ACPI
select UCS2_STRING
select EFI_RUNTIME_WRAPPERS
---help---
This enables the kernel to use EFI runtime services that are
available (such as the EFI variable services).
This option is only useful on systems that have EFI firmware.
In addition, you should use the latest ELILO loader available
at <http://elilo.sourceforge.net> in order to take advantage
of EFI runtime services. However, even with this option, the
resultant kernel should continue to boot on existing non-EFI
platforms.
x86, efi: EFI boot stub support There is currently a large divide between kernel development and the development of EFI boot loaders. The idea behind this patch is to give the kernel developers full control over the EFI boot process. As H. Peter Anvin put it, "The 'kernel carries its own stub' approach been very successful in dealing with BIOS, and would make a lot of sense to me for EFI as well." This patch introduces an EFI boot stub that allows an x86 bzImage to be loaded and executed by EFI firmware. The bzImage appears to the firmware as an EFI application. Luckily there are enough free bits within the bzImage header so that it can masquerade as an EFI application, thereby coercing the EFI firmware into loading it and jumping to its entry point. The beauty of this masquerading approach is that both BIOS and EFI boot loaders can still load and run the same bzImage, thereby allowing a single kernel image to work in any boot environment. The EFI boot stub supports multiple initrds, but they must exist on the same partition as the bzImage. Command-line arguments for the kernel can be appended after the bzImage name when run from the EFI shell, e.g. Shell> bzImage console=ttyS0 root=/dev/sdb initrd=initrd.img v7: - Fix checkpatch warnings. v6: - Try to allocate initrd memory just below hdr->inird_addr_max. v5: - load_options_size is UTF-16, which needs dividing by 2 to convert to the corresponding ASCII size. v4: - Don't read more than image->load_options_size v3: - Fix following warnings when compiling CONFIG_EFI_STUB=n arch/x86/boot/tools/build.c: In function ‘main’: arch/x86/boot/tools/build.c:138:24: warning: unused variable ‘pe_header’ arch/x86/boot/tools/build.c:138:15: warning: unused variable ‘file_sz’ - As reported by Matthew Garrett, some Apple machines have GOPs that don't have hardware attached. We need to weed these out by searching for ones that handle the PCIIO protocol. - Don't allocate memory if no initrds are on cmdline - Don't trust image->load_options_size Maarten Lankhorst noted: - Don't strip first argument when booted from efibootmgr - Don't allocate too much memory for cmdline - Don't update cmdline_size, the kernel considers it read-only - Don't accept '\n' for initrd names v2: - File alignment was too large, was 8192 should be 512. Reported by Maarten Lankhorst on LKML. - Added UGA support for graphics - Use VIDEO_TYPE_EFI instead of hard-coded number. - Move linelength assignment until after we've assigned depth - Dynamically fill out AddressOfEntryPoint in tools/build.c - Don't use magic number for GDT/TSS stuff. Requested by Andi Kleen - The bzImage may need to be relocated as it may have been loaded at a high address address by the firmware. This was required to get my macbook booting because the firmware loaded it at 0x7cxxxxxx, which triggers this error in decompress_kernel(), if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff)) error("Destination address too large"); Cc: Mike Waychison <mikew@google.com> Cc: Matthew Garrett <mjg@redhat.com> Tested-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1321383097.2657.9.camel@mfleming-mobl1.ger.corp.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 05:27:52 +08:00
config EFI_STUB
bool "EFI stub support"
depends on EFI && !X86_USE_3DNOW
select RELOCATABLE
x86, efi: EFI boot stub support There is currently a large divide between kernel development and the development of EFI boot loaders. The idea behind this patch is to give the kernel developers full control over the EFI boot process. As H. Peter Anvin put it, "The 'kernel carries its own stub' approach been very successful in dealing with BIOS, and would make a lot of sense to me for EFI as well." This patch introduces an EFI boot stub that allows an x86 bzImage to be loaded and executed by EFI firmware. The bzImage appears to the firmware as an EFI application. Luckily there are enough free bits within the bzImage header so that it can masquerade as an EFI application, thereby coercing the EFI firmware into loading it and jumping to its entry point. The beauty of this masquerading approach is that both BIOS and EFI boot loaders can still load and run the same bzImage, thereby allowing a single kernel image to work in any boot environment. The EFI boot stub supports multiple initrds, but they must exist on the same partition as the bzImage. Command-line arguments for the kernel can be appended after the bzImage name when run from the EFI shell, e.g. Shell> bzImage console=ttyS0 root=/dev/sdb initrd=initrd.img v7: - Fix checkpatch warnings. v6: - Try to allocate initrd memory just below hdr->inird_addr_max. v5: - load_options_size is UTF-16, which needs dividing by 2 to convert to the corresponding ASCII size. v4: - Don't read more than image->load_options_size v3: - Fix following warnings when compiling CONFIG_EFI_STUB=n arch/x86/boot/tools/build.c: In function ‘main’: arch/x86/boot/tools/build.c:138:24: warning: unused variable ‘pe_header’ arch/x86/boot/tools/build.c:138:15: warning: unused variable ‘file_sz’ - As reported by Matthew Garrett, some Apple machines have GOPs that don't have hardware attached. We need to weed these out by searching for ones that handle the PCIIO protocol. - Don't allocate memory if no initrds are on cmdline - Don't trust image->load_options_size Maarten Lankhorst noted: - Don't strip first argument when booted from efibootmgr - Don't allocate too much memory for cmdline - Don't update cmdline_size, the kernel considers it read-only - Don't accept '\n' for initrd names v2: - File alignment was too large, was 8192 should be 512. Reported by Maarten Lankhorst on LKML. - Added UGA support for graphics - Use VIDEO_TYPE_EFI instead of hard-coded number. - Move linelength assignment until after we've assigned depth - Dynamically fill out AddressOfEntryPoint in tools/build.c - Don't use magic number for GDT/TSS stuff. Requested by Andi Kleen - The bzImage may need to be relocated as it may have been loaded at a high address address by the firmware. This was required to get my macbook booting because the firmware loaded it at 0x7cxxxxxx, which triggers this error in decompress_kernel(), if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff)) error("Destination address too large"); Cc: Mike Waychison <mikew@google.com> Cc: Matthew Garrett <mjg@redhat.com> Tested-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1321383097.2657.9.camel@mfleming-mobl1.ger.corp.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 05:27:52 +08:00
---help---
This kernel feature allows a bzImage to be loaded directly
by EFI firmware without the use of a bootloader.
See Documentation/efi-stub.txt for more information.
config EFI_MIXED
bool "EFI mixed-mode support"
depends on EFI_STUB && X86_64
---help---
Enabling this feature allows a 64-bit kernel to be booted
on a 32-bit firmware, provided that your CPU supports 64-bit
mode.
Note that it is not possible to boot a mixed-mode enabled
kernel via the EFI boot stub - a bootloader that supports
the EFI handover protocol must be used.
If unsure, say N.
config SECCOMP
def_bool y
prompt "Enable seccomp to safely compute untrusted bytecode"
---help---
This kernel feature is useful for number crunching applications
that may need to compute untrusted bytecode during their
execution. By using pipes or other transports made available to
the process as file descriptors supporting the read/write
syscalls, it's possible to isolate those applications in
their own address space using seccomp. Once seccomp is
enabled via prctl(PR_SET_SECCOMP), it cannot be disabled
and the task is only allowed to execute a few safe syscalls
defined by each seccomp mode.
If unsure, say Y. Only embedded should say N here.
source kernel/Kconfig.hz
config KEXEC
bool "kexec system call"
2015-09-10 06:38:55 +08:00
select KEXEC_CORE
---help---
kexec is a system call that implements the ability to shutdown your
current kernel, and to start another kernel. It is like a reboot
but it is independent of the system firmware. And like a reboot
you can start any kernel with it, not just Linux.
The name comes from the similarity to the exec system call.
It is an ongoing process to be certain the hardware in a machine
is properly shutdown, so do not be surprised if this code does not
initially work for you. As of this writing the exact hardware
interface is strongly in flux, so no good recommendation can be
made.
kexec: create a new config option CONFIG_KEXEC_FILE for new syscall Currently new system call kexec_file_load() and all the associated code compiles if CONFIG_KEXEC=y. But new syscall also compiles purgatory code which currently uses gcc option -mcmodel=large. This option seems to be available only gcc 4.4 onwards. Hiding new functionality behind a new config option will not break existing users of old gcc. Those who wish to enable new functionality will require new gcc. Having said that, I am trying to figure out how can I move away from using -mcmodel=large but that can take a while. I think there are other advantages of introducing this new config option. As this option will be enabled only on x86_64, other arches don't have to compile generic kexec code which will never be used. This new code selects CRYPTO=y and CRYPTO_SHA256=y. And all other arches had to do this for CONFIG_KEXEC. Now with introduction of new config option, we can remove crypto dependency from other arches. Now CONFIG_KEXEC_FILE is available only on x86_64. So whereever I had CONFIG_X86_64 defined, I got rid of that. For CONFIG_KEXEC_FILE, instead of doing select CRYPTO=y, I changed it to "depends on CRYPTO=y". This should be safer as "select" is not recursive. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-30 06:18:46 +08:00
config KEXEC_FILE
bool "kexec file based system call"
2015-09-10 06:38:55 +08:00
select KEXEC_CORE
kexec: create a new config option CONFIG_KEXEC_FILE for new syscall Currently new system call kexec_file_load() and all the associated code compiles if CONFIG_KEXEC=y. But new syscall also compiles purgatory code which currently uses gcc option -mcmodel=large. This option seems to be available only gcc 4.4 onwards. Hiding new functionality behind a new config option will not break existing users of old gcc. Those who wish to enable new functionality will require new gcc. Having said that, I am trying to figure out how can I move away from using -mcmodel=large but that can take a while. I think there are other advantages of introducing this new config option. As this option will be enabled only on x86_64, other arches don't have to compile generic kexec code which will never be used. This new code selects CRYPTO=y and CRYPTO_SHA256=y. And all other arches had to do this for CONFIG_KEXEC. Now with introduction of new config option, we can remove crypto dependency from other arches. Now CONFIG_KEXEC_FILE is available only on x86_64. So whereever I had CONFIG_X86_64 defined, I got rid of that. For CONFIG_KEXEC_FILE, instead of doing select CRYPTO=y, I changed it to "depends on CRYPTO=y". This should be safer as "select" is not recursive. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-30 06:18:46 +08:00
select BUILD_BIN2C
depends on X86_64
depends on CRYPTO=y
depends on CRYPTO_SHA256=y
---help---
This is new version of kexec system call. This system call is
file based and takes file descriptors as system call argument
for kernel and initramfs as opposed to list of segments as
accepted by previous system call.
kexec: verify the signature of signed PE bzImage This is the final piece of the puzzle of verifying kernel image signature during kexec_file_load() syscall. This patch calls into PE file routines to verify signature of bzImage. If signature are valid, kexec_file_load() succeeds otherwise it fails. Two new config options have been introduced. First one is CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be validly signed otherwise kernel load will fail. If this option is not set, no signature verification will be done. Only exception will be when secureboot is enabled. In that case signature verification should be automatically enforced when secureboot is enabled. But that will happen when secureboot patches are merged. Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option enables signature verification support on bzImage. If this option is not set and previous one is set, kernel image loading will fail because kernel does not have support to verify signature of bzImage. I tested these patches with both "pesign" and "sbsign" signed bzImages. I used signing_key.priv key and signing_key.x509 cert for signing as generated during kernel build process (if module signing is enabled). Used following method to sign bzImage. pesign ====== - Convert DER format cert to PEM format cert openssl x509 -in signing_key.x509 -inform DER -out signing_key.x509.PEM -outform PEM - Generate a .p12 file from existing cert and private key file openssl pkcs12 -export -out kernel-key.p12 -inkey signing_key.priv -in signing_key.x509.PEM - Import .p12 file into pesign db pk12util -i /tmp/kernel-key.p12 -d /etc/pki/pesign - Sign bzImage pesign -i /boot/vmlinuz-3.16.0-rc3+ -o /boot/vmlinuz-3.16.0-rc3+.signed.pesign -c "Glacier signing key - Magrathea" -s sbsign ====== sbsign --key signing_key.priv --cert signing_key.x509.PEM --output /boot/vmlinuz-3.16.0-rc3+.signed.sbsign /boot/vmlinuz-3.16.0-rc3+ Patch details: Well all the hard work is done in previous patches. Now bzImage loader has just call into that code and verify whether bzImage signature are valid or not. Also create two config options. First one is CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be validly signed otherwise kernel load will fail. If this option is not set, no signature verification will be done. Only exception will be when secureboot is enabled. In that case signature verification should be automatically enforced when secureboot is enabled. But that will happen when secureboot patches are merged. Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option enables signature verification support on bzImage. If this option is not set and previous one is set, kernel image loading will fail because kernel does not have support to verify signature of bzImage. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Fleming <matt@console-pimps.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:26:13 +08:00
config KEXEC_VERIFY_SIG
bool "Verify kernel signature during kexec_file_load() syscall"
kexec: create a new config option CONFIG_KEXEC_FILE for new syscall Currently new system call kexec_file_load() and all the associated code compiles if CONFIG_KEXEC=y. But new syscall also compiles purgatory code which currently uses gcc option -mcmodel=large. This option seems to be available only gcc 4.4 onwards. Hiding new functionality behind a new config option will not break existing users of old gcc. Those who wish to enable new functionality will require new gcc. Having said that, I am trying to figure out how can I move away from using -mcmodel=large but that can take a while. I think there are other advantages of introducing this new config option. As this option will be enabled only on x86_64, other arches don't have to compile generic kexec code which will never be used. This new code selects CRYPTO=y and CRYPTO_SHA256=y. And all other arches had to do this for CONFIG_KEXEC. Now with introduction of new config option, we can remove crypto dependency from other arches. Now CONFIG_KEXEC_FILE is available only on x86_64. So whereever I had CONFIG_X86_64 defined, I got rid of that. For CONFIG_KEXEC_FILE, instead of doing select CRYPTO=y, I changed it to "depends on CRYPTO=y". This should be safer as "select" is not recursive. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-30 06:18:46 +08:00
depends on KEXEC_FILE
kexec: verify the signature of signed PE bzImage This is the final piece of the puzzle of verifying kernel image signature during kexec_file_load() syscall. This patch calls into PE file routines to verify signature of bzImage. If signature are valid, kexec_file_load() succeeds otherwise it fails. Two new config options have been introduced. First one is CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be validly signed otherwise kernel load will fail. If this option is not set, no signature verification will be done. Only exception will be when secureboot is enabled. In that case signature verification should be automatically enforced when secureboot is enabled. But that will happen when secureboot patches are merged. Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option enables signature verification support on bzImage. If this option is not set and previous one is set, kernel image loading will fail because kernel does not have support to verify signature of bzImage. I tested these patches with both "pesign" and "sbsign" signed bzImages. I used signing_key.priv key and signing_key.x509 cert for signing as generated during kernel build process (if module signing is enabled). Used following method to sign bzImage. pesign ====== - Convert DER format cert to PEM format cert openssl x509 -in signing_key.x509 -inform DER -out signing_key.x509.PEM -outform PEM - Generate a .p12 file from existing cert and private key file openssl pkcs12 -export -out kernel-key.p12 -inkey signing_key.priv -in signing_key.x509.PEM - Import .p12 file into pesign db pk12util -i /tmp/kernel-key.p12 -d /etc/pki/pesign - Sign bzImage pesign -i /boot/vmlinuz-3.16.0-rc3+ -o /boot/vmlinuz-3.16.0-rc3+.signed.pesign -c "Glacier signing key - Magrathea" -s sbsign ====== sbsign --key signing_key.priv --cert signing_key.x509.PEM --output /boot/vmlinuz-3.16.0-rc3+.signed.sbsign /boot/vmlinuz-3.16.0-rc3+ Patch details: Well all the hard work is done in previous patches. Now bzImage loader has just call into that code and verify whether bzImage signature are valid or not. Also create two config options. First one is CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be validly signed otherwise kernel load will fail. If this option is not set, no signature verification will be done. Only exception will be when secureboot is enabled. In that case signature verification should be automatically enforced when secureboot is enabled. But that will happen when secureboot patches are merged. Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option enables signature verification support on bzImage. If this option is not set and previous one is set, kernel image loading will fail because kernel does not have support to verify signature of bzImage. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Fleming <matt@console-pimps.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:26:13 +08:00
---help---
This option makes kernel signature verification mandatory for
the kexec_file_load() syscall.
In addition to that option, you need to enable signature
verification for the corresponding kernel image type being
loaded in order for this to work.
kexec: verify the signature of signed PE bzImage This is the final piece of the puzzle of verifying kernel image signature during kexec_file_load() syscall. This patch calls into PE file routines to verify signature of bzImage. If signature are valid, kexec_file_load() succeeds otherwise it fails. Two new config options have been introduced. First one is CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be validly signed otherwise kernel load will fail. If this option is not set, no signature verification will be done. Only exception will be when secureboot is enabled. In that case signature verification should be automatically enforced when secureboot is enabled. But that will happen when secureboot patches are merged. Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option enables signature verification support on bzImage. If this option is not set and previous one is set, kernel image loading will fail because kernel does not have support to verify signature of bzImage. I tested these patches with both "pesign" and "sbsign" signed bzImages. I used signing_key.priv key and signing_key.x509 cert for signing as generated during kernel build process (if module signing is enabled). Used following method to sign bzImage. pesign ====== - Convert DER format cert to PEM format cert openssl x509 -in signing_key.x509 -inform DER -out signing_key.x509.PEM -outform PEM - Generate a .p12 file from existing cert and private key file openssl pkcs12 -export -out kernel-key.p12 -inkey signing_key.priv -in signing_key.x509.PEM - Import .p12 file into pesign db pk12util -i /tmp/kernel-key.p12 -d /etc/pki/pesign - Sign bzImage pesign -i /boot/vmlinuz-3.16.0-rc3+ -o /boot/vmlinuz-3.16.0-rc3+.signed.pesign -c "Glacier signing key - Magrathea" -s sbsign ====== sbsign --key signing_key.priv --cert signing_key.x509.PEM --output /boot/vmlinuz-3.16.0-rc3+.signed.sbsign /boot/vmlinuz-3.16.0-rc3+ Patch details: Well all the hard work is done in previous patches. Now bzImage loader has just call into that code and verify whether bzImage signature are valid or not. Also create two config options. First one is CONFIG_KEXEC_VERIFY_SIG. This option enforces that kernel has to be validly signed otherwise kernel load will fail. If this option is not set, no signature verification will be done. Only exception will be when secureboot is enabled. In that case signature verification should be automatically enforced when secureboot is enabled. But that will happen when secureboot patches are merged. Second config option is CONFIG_KEXEC_BZIMAGE_VERIFY_SIG. This option enables signature verification support on bzImage. If this option is not set and previous one is set, kernel image loading will fail because kernel does not have support to verify signature of bzImage. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Dave Young <dyoung@redhat.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Fleming <matt@console-pimps.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 05:26:13 +08:00
config KEXEC_BZIMAGE_VERIFY_SIG
bool "Enable bzImage signature verification support"
depends on KEXEC_VERIFY_SIG
depends on SIGNED_PE_FILE_VERIFICATION
select SYSTEM_TRUSTED_KEYRING
---help---
Enable bzImage signature verification support.
config CRASH_DUMP
bool "kernel crash dumps"
depends on X86_64 || (X86_32 && HIGHMEM)
---help---
Generate crash dump after being started by kexec.
This should be normally only set in special crash dump kernels
which are loaded in the main kernel with kexec-tools into
a specially reserved region and then later executed after
a crash by kdump/kexec. The crash dump kernel must be compiled
to a memory address not used by the main kernel or BIOS using
PHYSICAL_START, or it must be built as a relocatable image
(CONFIG_RELOCATABLE=y).
For more details see Documentation/kdump/kdump.txt
kexec jump This patch provides an enhancement to kexec/kdump. It implements the following features: - Backup/restore memory used by the original kernel before/after kexec. - Save/restore CPU state before/after kexec. The features of this patch can be used as a general method to call program in physical mode (paging turning off). This can be used to call BIOS code under Linux. kexec-tools needs to be patched to support kexec jump. The patches and the precompiled kexec can be download from the following URL: source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2 patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2 binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10 Usage example of calling some physical mode code and return: 1. Compile and install patched kernel with following options selected: CONFIG_X86_32=y CONFIG_KEXEC=y CONFIG_PM=y CONFIG_KEXEC_JUMP=y 2. Build patched kexec-tool or download the pre-built one. 3. Build some physical mode executable named such as "phy_mode" 4. Boot kernel compiled in step 1. 5. Load physical mode executable with /sbin/kexec. The shell command line can be as follow: /sbin/kexec --load-preserve-context --args-none phy_mode 6. Call physical mode executable with following shell command line: /sbin/kexec -e Implementation point: To support jumping without reserving memory. One shadow backup page (source page) is allocated for each page used by kexeced code image (destination page). When do kexec_load, the image of kexeced code is loaded into source pages, and before executing, the destination pages and the source pages are swapped, so the contents of destination pages are backupped. Before jumping to the kexeced code image and after jumping back to the original kernel, the destination pages and the source pages are swapped too. C ABI (calling convention) is used as communication protocol between kernel and called code. A flag named KEXEC_PRESERVE_CONTEXT for sys_kexec_load is added to indicate that the loaded kernel image is used for jumping back. Now, only the i386 architecture is supported. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 10:45:07 +08:00
config KEXEC_JUMP
bool "kexec jump"
depends on KEXEC && HIBERNATION
---help---
kexec jump: save/restore device state This patch implements devices state save/restore before after kexec. This patch together with features in kexec_jump patch can be used for following: - A simple hibernation implementation without ACPI support. You can kexec a hibernating kernel, save the memory image of original system and shutdown the system. When resuming, you restore the memory image of original system via ordinary kexec load then jump back. - Kernel/system debug through making system snapshot. You can make system snapshot, jump back, do some thing and make another system snapshot. - Cooperative multi-kernel/system. With kexec jump, you can switch between several kernels/systems quickly without boot process except the first time. This appears like swap a whole kernel/system out/in. - A general method to call program in physical mode (paging turning off). This can be used to invoke BIOS code under Linux. The following user-space tools can be used with kexec jump: - kexec-tools needs to be patched to support kexec jump. The patches and the precompiled kexec can be download from the following URL: source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2 patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2 binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10 - makedumpfile with patches are used as memory image saving tool, it can exclude free pages from original kernel memory image file. The patches and the precompiled makedumpfile can be download from the following URL: source: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile-src_cvs_kh10.tar.bz2 patches: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile-patches_cvs_kh10.tar.bz2 binary: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile_cvs_kh10 - An initramfs image can be used as the root file system of kexeced kernel. An initramfs image built with "BuildRoot" can be downloaded from the following URL: initramfs image: http://khibernation.sourceforge.net/download/release_v10/initramfs/rootfs_cvs_kh10.gz All user space tools above are included in the initramfs image. Usage example of simple hibernation: 1. Compile and install patched kernel with following options selected: CONFIG_X86_32=y CONFIG_RELOCATABLE=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y CONFIG_PM=y CONFIG_HIBERNATION=y CONFIG_KEXEC_JUMP=y 2. Build an initramfs image contains kexec-tool and makedumpfile, or download the pre-built initramfs image, called rootfs.gz in following text. 3. Prepare a partition to save memory image of original kernel, called hibernating partition in following text. 4. Boot kernel compiled in step 1 (kernel A). 5. In the kernel A, load kernel compiled in step 1 (kernel B) with /sbin/kexec. The shell command line can be as follow: /sbin/kexec --load-preserve-context /boot/bzImage --mem-min=0x100000 --mem-max=0xffffff --initrd=rootfs.gz 6. Boot the kernel B with following shell command line: /sbin/kexec -e 7. The kernel B will boot as normal kexec. In kernel B the memory image of kernel A can be saved into hibernating partition as follow: jump_back_entry=`cat /proc/cmdline | tr ' ' '\n' | grep kexec_jump_back_entry | cut -d '='` echo $jump_back_entry > kexec_jump_back_entry cp /proc/vmcore dump.elf Then you can shutdown the machine as normal. 8. Boot kernel compiled in step 1 (kernel C). Use the rootfs.gz as root file system. 9. In kernel C, load the memory image of kernel A as follow: /sbin/kexec -l --args-none --entry=`cat kexec_jump_back_entry` dump.elf 10. Jump back to the kernel A as follow: /sbin/kexec -e Then, kernel A is resumed. Implementation point: To support jumping between two kernels, before jumping to (executing) the new kernel and jumping back to the original kernel, the devices are put into quiescent state, and the state of devices and CPU is saved. After jumping back from kexeced kernel and jumping to the new kernel, the state of devices and CPU are restored accordingly. The devices/CPU state save/restore code of software suspend is called to implement corresponding function. Known issues: - Because the segment number supported by sys_kexec_load is limited, hibernation image with many segments may not be load. This is planned to be eliminated by adding a new flag to sys_kexec_load to make a image can be loaded with multiple sys_kexec_load invoking. Now, only the i386 architecture is supported. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 10:45:10 +08:00
Jump between original kernel and kexeced kernel and invoke
code in physical address mode via KEXEC
kexec jump This patch provides an enhancement to kexec/kdump. It implements the following features: - Backup/restore memory used by the original kernel before/after kexec. - Save/restore CPU state before/after kexec. The features of this patch can be used as a general method to call program in physical mode (paging turning off). This can be used to call BIOS code under Linux. kexec-tools needs to be patched to support kexec jump. The patches and the precompiled kexec can be download from the following URL: source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2 patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2 binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10 Usage example of calling some physical mode code and return: 1. Compile and install patched kernel with following options selected: CONFIG_X86_32=y CONFIG_KEXEC=y CONFIG_PM=y CONFIG_KEXEC_JUMP=y 2. Build patched kexec-tool or download the pre-built one. 3. Build some physical mode executable named such as "phy_mode" 4. Boot kernel compiled in step 1. 5. Load physical mode executable with /sbin/kexec. The shell command line can be as follow: /sbin/kexec --load-preserve-context --args-none phy_mode 6. Call physical mode executable with following shell command line: /sbin/kexec -e Implementation point: To support jumping without reserving memory. One shadow backup page (source page) is allocated for each page used by kexeced code image (destination page). When do kexec_load, the image of kexeced code is loaded into source pages, and before executing, the destination pages and the source pages are swapped, so the contents of destination pages are backupped. Before jumping to the kexeced code image and after jumping back to the original kernel, the destination pages and the source pages are swapped too. C ABI (calling convention) is used as communication protocol between kernel and called code. A flag named KEXEC_PRESERVE_CONTEXT for sys_kexec_load is added to indicate that the loaded kernel image is used for jumping back. Now, only the i386 architecture is supported. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 10:45:07 +08:00
config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)
default "0x1000000"
---help---
This gives the physical address where the kernel is loaded.
If kernel is a not relocatable (CONFIG_RELOCATABLE=n) then
bzImage will decompress itself to above physical address and
run from there. Otherwise, bzImage will run from the address where
it has been loaded by the boot loader and will ignore above physical
address.
In normal kdump cases one does not have to set/change this option
as now bzImage can be compiled as a completely relocatable image
(CONFIG_RELOCATABLE=y) and be used to load and run from a different
address. This option is mainly useful for the folks who don't want
to use a bzImage for capturing the crash dump and want to use a
vmlinux instead. vmlinux is not relocatable hence a kernel needs
to be specifically compiled to run from a specific memory area
(normally a reserved region) and this option comes handy.
So if you are using bzImage for capturing the crash dump,
leave the value here unchanged to 0x1000000 and set
CONFIG_RELOCATABLE=y. Otherwise if you plan to use vmlinux
for capturing the crash dump change this value to start of
the reserved region. In other words, it can be set based on
the "X" value as specified in the "crashkernel=YM@XM"
command line boot parameter passed to the panic-ed
kernel. Please take a look at Documentation/kdump/kdump.txt
for more details about crash dumps.
Usage of bzImage for capturing the crash dump is recommended as
one does not have to build two kernels. Same kernel can be used
as production kernel and capture kernel. Above option should have
gone away after relocatable bzImage support is introduced. But it
is present because there are users out there who continue to use
vmlinux for dump capture. This option should go away down the
line.
Don't change this unless you know what you are doing.
config RELOCATABLE
bool "Build a relocatable kernel"
default y
---help---
This builds a kernel image that retains relocation information
so it can be loaded someplace besides the default 1MB.
The relocations tend to make the kernel binary about 10% larger,
but are discarded at runtime.
One use is for the kexec on panic case where the recovery kernel
must live at a different physical address than the primary
kernel.
Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address
it has been loaded at and the compile time physical address
(CONFIG_PHYSICAL_START) is used as the minimum location.
config RANDOMIZE_BASE
bool "Randomize the address of the kernel image"
depends on RELOCATABLE
default n
---help---
Randomizes the physical and virtual address at which the
kernel image is decompressed, as a security feature that
deters exploit attempts relying on knowledge of the location
of kernel internals.
Entropy is generated using the RDRAND instruction if it is
supported. If RDTSC is supported, it is used as well. If
neither RDRAND nor RDTSC are supported, then randomness is
read from the i8254 timer.
The kernel will be offset by up to RANDOMIZE_BASE_MAX_OFFSET,
and aligned according to PHYSICAL_ALIGN. Since the kernel is
built using 2GiB addressing, and PHYSICAL_ALGIN must be at a
minimum of 2MiB, only 10 bits of entropy is theoretically
possible. At best, due to page table layouts, 64-bit can use
9 bits of entropy and 32-bit uses 8 bits.
If unsure, say N.
config RANDOMIZE_BASE_MAX_OFFSET
hex "Maximum kASLR offset allowed" if EXPERT
depends on RANDOMIZE_BASE
range 0x0 0x20000000 if X86_32
default "0x20000000" if X86_32
range 0x0 0x40000000 if X86_64
default "0x40000000" if X86_64
---help---
The lesser of RANDOMIZE_BASE_MAX_OFFSET and available physical
memory is used to determine the maximal offset in bytes that will
be applied to the kernel when kernel Address Space Layout
Randomization (kASLR) is active. This must be a multiple of
PHYSICAL_ALIGN.
On 32-bit this is limited to 512MiB by page table layouts. The
default is 512MiB.
On 64-bit this is limited by how the kernel fixmap page table is
positioned, so this cannot be larger than 1GiB currently. Without
RANDOMIZE_BASE, there is a 512MiB to 1.5GiB split between kernel
and modules. When RANDOMIZE_BASE_MAX_OFFSET is above 512MiB, the
modules area will shrink to compensate, up to the current maximum
1GiB to 1GiB split. The default is 1GiB.
If unsure, leave at the default value.
# Relocation on x86 needs some additional build support
config X86_NEED_RELOCS
def_bool y
depends on RANDOMIZE_BASE || (X86_32 && RELOCATABLE)
config PHYSICAL_ALIGN
hex "Alignment value to which kernel should be aligned"
default "0x200000"
range 0x2000 0x1000000 if X86_32
range 0x200000 0x1000000 if X86_64
---help---
This value puts the alignment restrictions on physical address
where kernel is loaded and run from. Kernel is compiled for an
address which meets above alignment restriction.
If bootloader loads the kernel at a non-aligned address and
CONFIG_RELOCATABLE is set, kernel will move itself to nearest
address aligned to above value and run from there.
If bootloader loads the kernel at a non-aligned address and
CONFIG_RELOCATABLE is not set, kernel will ignore the run time
load address and decompress itself to the address it has been
compiled for and run from there. The address for which kernel is
compiled already meets above alignment restrictions. Hence the
end result is that kernel runs from a physical address meeting
above alignment restrictions.
On 32-bit this value must be a multiple of 0x2000. On 64-bit
this value must be a multiple of 0x200000.
Don't change this unless you know what you are doing.
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SMP
---help---
Say Y here to allow turning CPUs off and on. CPUs can be
controlled through /sys/devices/system/cpu.
( Note: power management support will enable this option
automatically on SMP systems. )
Say N if you want to disable CPU hotplug.
config BOOTPARAM_HOTPLUG_CPU0
bool "Set default setting of cpu0_hotpluggable"
default n
depends on HOTPLUG_CPU
---help---
Set whether default state of cpu0_hotpluggable is on or off.
Say Y here to enable CPU0 hotplug by default. If this switch
is turned on, there is no need to give cpu0_hotplug kernel
parameter and the CPU0 hotplug feature is enabled by default.
Please note: there are two known CPU0 dependencies if you want
to enable the CPU0 hotplug feature either by this switch or by
cpu0_hotplug kernel parameter.
First, resume from hibernate or suspend always starts from CPU0.
So hibernate and suspend are prevented if CPU0 is offline.
Second dependency is PIC interrupts always go to CPU0. CPU0 can not
offline if any interrupt can not migrate out of CPU0. There may
be other CPU0 dependencies.
Please make sure the dependencies are under your control before
you enable this feature.
Say N if you don't want to enable CPU0 hotplug feature by default.
You still can enable the CPU0 hotplug feature at boot by kernel
parameter cpu0_hotplug.
config DEBUG_HOTPLUG_CPU0
def_bool n
prompt "Debug CPU0 hotplug"
depends on HOTPLUG_CPU
---help---
Enabling this option offlines CPU0 (if CPU0 can be offlined) as
soon as possible and boots up userspace with CPU0 offlined. User
can online CPU0 back after boot time.
To debug CPU0 hotplug, you need to enable CPU0 offline/online
feature by either turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during
compilation or giving cpu0_hotplug kernel parameter at boot.
If unsure, say N.
config COMPAT_VDSO
def_bool n
prompt "Disable the 32-bit vDSO (needed for glibc 2.3.3)"
depends on X86_32 || IA32_EMULATION
---help---
Certain buggy versions of glibc will crash if they are
presented with a 32-bit vDSO that is not mapped at the address
indicated in its segment table.
The bug was introduced by f866314b89d56845f55e6f365e18b31ec978ec3a
and fixed by 3b3ddb4f7db98ec9e912ccdf54d35df4aa30e04a and
49ad572a70b8aeb91e57483a11dd1b77e31c4468. Glibc 2.3.3 is
the only released version with the bug, but OpenSUSE 9
contains a buggy "glibc 2.3.2".
The symptom of the bug is that everything crashes on startup, saying:
dl_main: Assertion `(void *) ph->p_vaddr == _rtld_local._dl_sysinfo_dso' failed!
Saying Y here changes the default value of the vdso32 boot
option from 1 to 0, which turns off the 32-bit vDSO entirely.
This works around the glibc bug but hurts performance.
If unsure, say N: if you are compiling your own kernel, you
are unlikely to be using a buggy version of glibc.
choice
prompt "vsyscall table for legacy applications"
depends on X86_64
default LEGACY_VSYSCALL_EMULATE
help
Legacy user code that does not know how to find the vDSO expects
to be able to issue three syscalls by calling fixed addresses in
kernel space. Since this location is not randomized with ASLR,
it can be used to assist security vulnerability exploitation.
This setting can be changed at boot time via the kernel command
line parameter vsyscall=[native|emulate|none].
On a system with recent enough glibc (2.14 or newer) and no
static binaries, you can say None without a performance penalty
to improve security.
If unsure, select "Emulate".
config LEGACY_VSYSCALL_NATIVE
bool "Native"
help
Actual executable code is located in the fixed vsyscall
address mapping, implementing time() efficiently. Since
this makes the mapping executable, it can be used during
security vulnerability exploitation (traditionally as
ROP gadgets). This configuration is not recommended.
config LEGACY_VSYSCALL_EMULATE
bool "Emulate"
help
The kernel traps and emulates calls into the fixed
vsyscall address mapping. This makes the mapping
non-executable, but it still contains known contents,
which could be used in certain rare security vulnerability
exploits. This configuration is recommended when userspace
still uses the vsyscall area.
config LEGACY_VSYSCALL_NONE
bool "None"
help
There will be no vsyscall mapping at all. This will
eliminate any risk of ASLR bypass due to the vsyscall
fixed address mapping. Attempts to use the vsyscalls
will be reported to dmesg, so that either old or
malicious userspace programs can be identified.
endchoice
config CMDLINE_BOOL
bool "Built-in kernel command line"
---help---
Allow for specifying boot arguments to the kernel at
build time. On some systems (e.g. embedded ones), it is
necessary or convenient to provide some or all of the
kernel boot arguments with the kernel itself (that is,
to not rely on the boot loader to provide them.)
To compile command line arguments into the kernel,
set this option to 'Y', then fill in the
boot arguments in CONFIG_CMDLINE.
Systems with fully functional boot loaders (i.e. non-embedded)
should leave this option set to 'N'.
config CMDLINE
string "Built-in kernel command string"
depends on CMDLINE_BOOL
default ""
---help---
Enter arguments here that should be compiled into the kernel
image and used at boot time. If the boot loader provides a
command line at boot time, it is appended to this string to
form the full kernel command line, when the system boots.
However, you can use the CONFIG_CMDLINE_OVERRIDE option to
change this behavior.
In most cases, the command line (whether built-in or provided
by the boot loader) should specify the device for the root
file system.
config CMDLINE_OVERRIDE
bool "Built-in command line overrides boot loader arguments"
depends on CMDLINE_BOOL
---help---
Set this option to 'Y' to have the kernel ignore the boot loader
command line, and use ONLY the built-in command line.
This is used to work around broken boot loaders. This should
be set to 'N' under normal conditions.
config MODIFY_LDT_SYSCALL
bool "Enable the LDT (local descriptor table)" if EXPERT
default y
---help---
Linux can allow user programs to install a per-process x86
Local Descriptor Table (LDT) using the modify_ldt(2) system
call. This is required to run 16-bit or segmented code such as
DOSEMU or some Wine programs. It is also used by some very old
threading libraries.
Enabling this feature adds a small amount of overhead to
context switches and increases the low-level kernel attack
surface. Disabling it removes the modify_ldt(2) system call.
Saying 'N' here may make sense for embedded or server kernels.
source "kernel/livepatch/Kconfig"
endmenu
config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
depends on X86_64 || (X86_32 && HIGHMEM)
config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
depends on MEMORY_HOTPLUG
config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
config ARCH_ENABLE_SPLIT_PMD_PTLOCK
def_bool y
depends on X86_64 || X86_PAE
config ARCH_ENABLE_HUGEPAGE_MIGRATION
def_bool y
depends on X86_64 && HUGETLB_PAGE && MIGRATION
menu "Power management and ACPI options"
config ARCH_HIBERNATION_HEADER
def_bool y
depends on X86_64 && HIBERNATION
source "kernel/power/Kconfig"
source "drivers/acpi/Kconfig"
source "drivers/sfi/Kconfig"
config X86_APM_BOOT
def_bool y
depends on APM
menuconfig APM
tristate "APM (Advanced Power Management) BIOS support"
depends on X86_32 && PM_SLEEP
---help---
APM is a BIOS specification for saving power using several different
techniques. This is mostly useful for battery powered laptops with
APM compliant BIOSes. If you say Y here, the system time will be
reset after a RESUME operation, the /proc/apm device will provide
battery status information, and user-space programs will receive
notification of APM "events" (e.g. battery status change).
If you select "Y" here, you can disable actual use of the APM
BIOS by passing the "apm=off" option to the kernel at boot time.
Note that the APM support is almost completely disabled for
machines with more than one CPU.
In order to use APM, you will need supporting software. For location
and more information, read <file:Documentation/power/apm-acpi.txt>
and the Battery Powered Linux mini-HOWTO, available from
<http://www.tldp.org/docs.html#howto>.
This driver does not spin down disk drives (see the hdparm(8)
manpage ("man 8 hdparm") for that), and it doesn't turn off
VESA-compliant "green" monitors.
This driver does not support the TI 4000M TravelMate and the ACER
486/DX4/75 because they don't have compliant BIOSes. Many "green"
desktop machines also don't have compliant BIOSes, and this driver
may cause those machines to panic during the boot phase.
Generally, if you don't have a battery in your machine, there isn't
much point in using this driver and you should say N. If you get
random kernel OOPSes or reboots that don't seem to be related to
anything, try disabling/enabling this option (or disabling/enabling
APM in your BIOS).
Some other things you should try when experiencing seemingly random,
"weird" problems:
1) make sure that you have enough swap space and that it is
enabled.
2) pass the "no-hlt" option to the kernel
3) switch on floating point emulation in the kernel and pass
the "no387" option to the kernel
4) pass the "floppy=nodma" option to the kernel
5) pass the "mem=4M" option to the kernel (thereby disabling
all but the first 4 MB of RAM)
6) make sure that the CPU is not over clocked.
7) read the sig11 FAQ at <http://www.bitwizard.nl/sig11/>
8) disable the cache from your BIOS settings
9) install a fan for the video card or exchange video RAM
10) install a better fan for the CPU
11) exchange RAM chips
12) exchange the motherboard.
To compile this driver as a module, choose M here: the
module will be called apm.
if APM
config APM_IGNORE_USER_SUSPEND
bool "Ignore USER SUSPEND"
---help---
This option will ignore USER SUSPEND requests. On machines with a
compliant APM BIOS, you want to say N. However, on the NEC Versa M
series notebooks, it is necessary to say Y because of a BIOS bug.
config APM_DO_ENABLE
bool "Enable PM at boot time"
---help---
Enable APM features at boot time. From page 36 of the APM BIOS
specification: "When disabled, the APM BIOS does not automatically
power manage devices, enter the Standby State, enter the Suspend
State, or take power saving steps in response to CPU Idle calls."
This driver will make CPU Idle calls when Linux is idle (unless this
feature is turned off -- see "Do CPU IDLE calls", below). This
should always save battery power, but more complicated APM features
will be dependent on your BIOS implementation. You may need to turn
this option off if your computer hangs at boot time when using APM
support, or if it beeps continuously instead of suspending. Turn
this off if you have a NEC UltraLite Versa 33/C or a Toshiba
T400CDT. This is off by default since most machines do fine without
this feature.
config APM_CPU_IDLE
depends on CPU_IDLE
bool "Make CPU Idle calls when idle"
---help---
Enable calls to APM CPU Idle/CPU Busy inside the kernel's idle loop.
On some machines, this can activate improved power savings, such as
a slowed CPU clock rate, when the machine is idle. These idle calls
are made after the idle loop has run for some length of time (e.g.,
333 mS). On some machines, this will cause a hang at boot time or
whenever the CPU becomes idle. (On machines with more than one CPU,
this option does nothing.)
config APM_DISPLAY_BLANK
bool "Enable console blanking using APM"
---help---
Enable console blanking using the APM. Some laptops can use this to
turn off the LCD backlight when the screen blanker of the Linux
virtual console blanks the screen. Note that this is only used by
the virtual console screen blanker, and won't turn off the backlight
when using the X Window system. This also doesn't have anything to
do with your VESA-compliant power-saving monitor. Further, this
option doesn't work for all laptops -- it might not turn off your
backlight at all, or it might print a lot of errors to the console,
especially if you are using gpm.
config APM_ALLOW_INTS
bool "Allow interrupts during APM BIOS calls"
---help---
Normally we disable external interrupts while we are making calls to
the APM BIOS as a measure to lessen the effects of a badly behaving
BIOS implementation. The BIOS should reenable interrupts if it
needs to. Unfortunately, some BIOSes do not -- especially those in
many of the newer IBM Thinkpads. If you experience hangs when you
suspend, try setting this to Y. Otherwise, say N.
endif # APM
source "drivers/cpufreq/Kconfig"
source "drivers/cpuidle/Kconfig"
source "drivers/idle/Kconfig"
endmenu
menu "Bus options (PCI etc.)"
config PCI
bool "PCI support"
default y
---help---
Find out whether you have a PCI motherboard. PCI is the name of a
bus system, i.e. the way the CPU talks to the other stuff inside
your box. Other bus systems are ISA, EISA, MicroChannel (MCA) or
VESA. If you have PCI, say Y, otherwise N.
choice
prompt "PCI access mode"
depends on X86_32 && PCI
default PCI_GOANY
---help---
On PCI systems, the BIOS can be used to detect the PCI devices and
determine their configuration. However, some old PCI motherboards
have BIOS bugs and may crash if this is done. Also, some embedded
PCI-based systems don't have any BIOS at all. Linux can also try to
detect the PCI hardware directly without using the BIOS.
With this option, you can specify how Linux should detect the
PCI devices. If you choose "BIOS", the BIOS will be used,
if you choose "Direct", the BIOS won't be used, and if you
choose "MMConfig", then PCI Express MMCONFIG will be used.
If you choose "Any", the kernel will try MMCONFIG, then the
direct access method and falls back to the BIOS if that doesn't
work. If unsure, go with the default, which is "Any".
config PCI_GOBIOS
bool "BIOS"
config PCI_GOMMCONFIG
bool "MMConfig"
config PCI_GODIRECT
bool "Direct"
config PCI_GOOLPC
bool "OLPC XO-1"
depends on OLPC
config PCI_GOANY
bool "Any"
endchoice
config PCI_BIOS
def_bool y
depends on X86_32 && PCI && (PCI_GOBIOS || PCI_GOANY)
# x86-64 doesn't support PCI BIOS access from long mode so always go direct.
config PCI_DIRECT
def_bool y
depends on PCI && (X86_64 || (PCI_GODIRECT || PCI_GOANY || PCI_GOOLPC || PCI_GOMMCONFIG))
config PCI_MMCONFIG
def_bool y
depends on X86_32 && PCI && (ACPI || SFI) && (PCI_GOMMCONFIG || PCI_GOANY)
config PCI_OLPC
def_bool y
depends on PCI && OLPC && (PCI_GOOLPC || PCI_GOANY)
config PCI_XEN
def_bool y
depends on PCI && XEN
select SWIOTLB_XEN
config PCI_DOMAINS
def_bool y
depends on PCI
config PCI_MMCONFIG
bool "Support mmconfig PCI config space access"
depends on X86_64 && PCI && ACPI
config PCI_CNB20LE_QUIRK
bool "Read CNB20LE Host Bridge Windows" if EXPERT
depends on PCI
help
Read the PCI windows out of the CNB20LE host bridge. This allows
PCI hotplug to work on systems with the CNB20LE chipset which do
not have ACPI.
There's no public spec for this chipset, and this functionality
is known to be incomplete.
You should say N unless you know you need this.
source "drivers/pci/Kconfig"
# x86_64 have no ISA slots, but can have ISA-style DMA.
config ISA_DMA_API
bool "ISA-style DMA support" if (X86_64 && EXPERT)
default y
help
Enables ISA-style DMA support for devices requiring such controllers.
If unsure, say Y.
if X86_32
config ISA
bool "ISA support"
---help---
Find out whether you have ISA slots on your motherboard. ISA is the
name of a bus system, i.e. the way the CPU talks to the other stuff
inside your box. Other bus systems are PCI, EISA, MicroChannel
(MCA) or VESA. ISA is an older system, now being displaced by PCI;
newer boards don't support it. If you have ISA, say Y, otherwise N.
config EISA
bool "EISA support"
depends on ISA
---help---
The Extended Industry Standard Architecture (EISA) bus was
developed as an open alternative to the IBM MicroChannel bus.
The EISA bus provided some of the features of the IBM MicroChannel
bus while maintaining backward compatibility with cards made for
the older ISA bus. The EISA bus saw limited use between 1988 and
1995 when it was made obsolete by the PCI bus.
Say Y here if you are building a kernel for an EISA-based machine.
Otherwise, say N.
source "drivers/eisa/Kconfig"
config SCx200
tristate "NatSemi SCx200 support"
---help---
This provides basic support for National Semiconductor's
(now AMD's) Geode processors. The driver probes for the
PCI-IDs of several on-chip devices, so its a good dependency
for other scx200_* drivers.
If compiled as a module, the driver is named scx200.
config SCx200HR_TIMER
tristate "NatSemi SCx200 27MHz High-Resolution Timer Support"
depends on SCx200
default y
---help---
This driver provides a clocksource built upon the on-chip
27MHz high-resolution timer. Its also a workaround for
NSC Geode SC-1100's buggy TSC, which loses time when the
processor goes idle (as is done by the scheduler). The
other workaround is idle=poll boot option.
config OLPC
bool "One Laptop Per Child support"
depends on !X86_PAE
select GPIOLIB
select OF
select OF_PROMTREE
select IRQ_DOMAIN
---help---
Add support for detecting the unique features of the OLPC
XO hardware.
config OLPC_XO1_PM
bool "OLPC XO-1 Power Management"
depends on OLPC && MFD_CS5535 && PM_SLEEP
select MFD_CORE
---help---
Add support for poweroff and suspend of the OLPC XO-1 laptop.
config OLPC_XO1_RTC
bool "OLPC XO-1 Real Time Clock"
depends on OLPC_XO1_PM && RTC_DRV_CMOS
---help---
Add support for the XO-1 real time clock, which can be used as a
programmable wakeup source.
config OLPC_XO1_SCI
bool "OLPC XO-1 SCI extras"
depends on OLPC && OLPC_XO1_PM
depends on INPUT=y
select POWER_SUPPLY
select GPIO_CS5535
select MFD_CORE
---help---
Add support for SCI-based features of the OLPC XO-1 laptop:
- EC-driven system wakeups
- Power button
- Ebook switch
- Lid switch
- AC adapter status updates
- Battery status updates
config OLPC_XO15_SCI
bool "OLPC XO-1.5 SCI extras"
depends on OLPC && ACPI
select POWER_SUPPLY
---help---
Add support for SCI-based features of the OLPC XO-1.5 laptop:
- EC-driven system wakeups
- AC adapter status updates
- Battery status updates
config ALIX
bool "PCEngines ALIX System Support (LED setup)"
select GPIOLIB
---help---
This option enables system support for the PCEngines ALIX.
At present this just sets up LEDs for GPIO control on
ALIX2/3/6 boards. However, other system specific setup should
get added here.
Note: You must still enable the drivers for GPIO and LED support
(GPIO_CS5535 & LEDS_GPIO) to actually use the LEDs
Note: You have to set alix.force=1 for boards with Award BIOS.
config NET5501
bool "Soekris Engineering net5501 System Support (LEDS, GPIO, etc)"
select GPIOLIB
---help---
This option enables system support for the Soekris Engineering net5501.
config GEOS
bool "Traverse Technologies GEOS System Support (LEDS, GPIO, etc)"
select GPIOLIB
depends on DMI
---help---
This option enables system support for the Traverse Technologies GEOS.
config TS5500
bool "Technologic Systems TS-5500 platform support"
depends on MELAN
select CHECK_SIGNATURE
select NEW_LEDS
select LEDS_CLASS
---help---
This option enables system support for the Technologic Systems TS-5500.
endif # X86_32
config AMD_NB
def_bool y
depends on CPU_SUP_AMD && PCI
source "drivers/pcmcia/Kconfig"
config RAPIDIO
tristate "RapidIO support"
depends on PCI
default n
help
If enabled this option will include drivers and the core
infrastructure code to support RapidIO interconnect devices.
source "drivers/rapidio/Kconfig"
x86: provide platform-devices for boot-framebuffers The current situation regarding boot-framebuffers (VGA, VESA/VBE, EFI) on x86 causes troubles when loading multiple fbdev drivers. The global "struct screen_info" does not provide any state-tracking about which drivers use the FBs. request_mem_region() theoretically works, but unfortunately vesafb/efifb ignore it due to quirks for broken boards. Avoid this by creating a platform framebuffer devices with a pointer to the "struct screen_info" as platform-data. Drivers can now create platform-drivers and the driver-core will refuse multiple drivers being active simultaneously. We keep the screen_info available for backwards-compatibility. Drivers can be converted in follow-up patches. Different devices are created for VGA/VESA/EFI FBs to allow multiple drivers to be loaded on distro kernels. We create: - "vesa-framebuffer" for VBE/VESA graphics FBs - "efi-framebuffer" for EFI FBs - "platform-framebuffer" for everything else This allows to load vesafb, efifb and others simultaneously and each picks up only the supported FB types. Apart from platform-framebuffer devices, this also introduces a compatibility option for "simple-framebuffer" drivers which recently got introduced for OF based systems. If CONFIG_X86_SYSFB is selected, we try to match the screen_info against a simple-framebuffer supported format. If we succeed, we create a "simple-framebuffer" device instead of a platform-framebuffer. This allows to reuse the simplefb.c driver across architectures and also to introduce a SimpleDRM driver. There is no need to have vesafb.c, efifb.c, simplefb.c and more just to have architecture specific quirks in their setup-routines. Instead, we now move the architecture specific quirks into x86-setup and provide a generic simple-framebuffer. For backwards-compatibility (if strange formats are used), we still allow vesafb/efifb to be loaded simultaneously and pick up all remaining devices. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Link: http://lkml.kernel.org/r/1375445127-15480-4-git-send-email-dh.herrmann@gmail.com Tested-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-02 20:05:22 +08:00
config X86_SYSFB
bool "Mark VGA/VBE/EFI FB as generic system framebuffer"
help
Firmwares often provide initial graphics framebuffers so the BIOS,
bootloader or kernel can show basic video-output during boot for
user-guidance and debugging. Historically, x86 used the VESA BIOS
Extensions and EFI-framebuffers for this, which are mostly limited
to x86.
This option, if enabled, marks VGA/VBE/EFI framebuffers as generic
framebuffers so the new generic system-framebuffer drivers can be
used on x86. If the framebuffer is not compatible with the generic
modes, it is adverticed as fallback platform framebuffer so legacy
drivers like efifb, vesafb and uvesafb can pick it up.
If this option is not selected, all system framebuffers are always
marked as fallback platform framebuffers as usual.
Note: Legacy fbdev drivers, including vesafb, efifb, uvesafb, will
not be able to pick up generic system framebuffers if this option
is selected. You are highly encouraged to enable simplefb as
replacement if you select this option. simplefb can correctly deal
with generic system framebuffers. But you should still keep vesafb
and others enabled as fallback if a system framebuffer is
incompatible with simplefb.
If unsure, say Y.
endmenu
menu "Executable file formats / Emulations"
source "fs/Kconfig.binfmt"
config IA32_EMULATION
bool "IA32 Emulation"
depends on X86_64
select BINFMT_ELF
select COMPAT_BINFMT_ELF
select ARCH_WANT_OLD_COMPAT_IPC
---help---
Include code to run legacy 32-bit programs under a
64-bit kernel. You should likely turn this on, unless you're
100% sure that you don't have any 32-bit programs left.
config IA32_AOUT
tristate "IA32 a.out support"
depends on IA32_EMULATION
---help---
Support old a.out binaries in the 32bit emulation.
config X86_X32
bool "x32 ABI for 64-bit mode"
depends on X86_64
---help---
Include code to run binaries for the x32 native 32-bit ABI
for 64-bit processors. An x32 process gets access to the
full 64-bit register file and wide data path while leaving
pointers at 32 bits for smaller memory footprint.
You will need a recent binutils (2.22 or later) with
elf32_x86_64 support enabled to compile a kernel with this
option set.
config COMPAT
def_bool y
depends on IA32_EMULATION || X86_X32
if COMPAT
config COMPAT_FOR_U64_ALIGNMENT
def_bool y
config SYSVIPC_COMPAT
def_bool y
depends on SYSVIPC
config KEYS_COMPAT
def_bool y
depends on KEYS
endif
endmenu
config HAVE_ATOMIC_IOMAP
def_bool y
depends on X86_32
config X86_DEV_DMA_OPS
bool
depends on X86_64 || STA2X11
config X86_DMA_REMAP
bool
depends on STA2X11
config PMC_ATOM
def_bool y
depends on PCI
x86/PCI: Add driver for Intel Volume Management Device (VMD) The Intel Volume Management Device (VMD) is a Root Complex Integrated Endpoint that acts as a host bridge to a secondary PCIe domain. BIOS can reassign one or more Root Ports to appear within a VMD domain instead of the primary domain. The immediate benefit is that additional PCIe domains allow more than 256 buses in a system by letting bus numbers be reused across different domains. VMD domains do not define ACPI _SEG, so to avoid domain clashing with host bridges defining this segment, VMD domains start at 0x10000, which is greater than the highest possible 16-bit ACPI defined _SEG. This driver enumerates and enables the domain using the root bus configuration interface provided by the PCI subsystem. The driver provides configuration space accessor functions (pci_ops), bus and memory resources, an MSI IRQ domain with irq_chip implementation, and DMA operations necessary to use devices through the VMD endpoint's interface. VMD routes I/O as follows: 1) Configuration Space: BAR 0 ("CFGBAR") of VMD provides the base address and size for configuration space register access to VMD-owned root ports. It works similarly to MMCONFIG for extended configuration space. Bus numbering is independent and does not conflict with the primary domain. 2) MMIO Space: BARs 2 and 4 ("MEMBAR1" and "MEMBAR2") of VMD provide the base address, size, and type for MMIO register access. These addresses are not translated by VMD hardware; they are simply reservations to be distributed to root ports' memory base/limit registers and subdivided among devices downstream. 3) DMA: To interact appropriately with an IOMMU, the source ID DMA read and write requests are translated to the bus-device-function of the VMD endpoint. Otherwise, DMA operates normally without VMD-specific address translation. 4) Interrupts: Part of VMD's BAR 4 is reserved for VMD's MSI-X Table and PBA. MSIs from VMD domain devices and ports are remapped to appear as if they were issued using one of VMD's MSI-X table entries. Each MSI and MSI-X address of VMD-owned devices and ports has a special format where the address refers to specific entries in the VMD's MSI-X table. As with DMA, the interrupt source ID is translated to VMD's bus-device-function. The driver provides its own MSI and MSI-X configuration functions specific to how MSI messages are used within the VMD domain, and provides an irq_chip for independent IRQ allocation to relay interrupts from VMD's interrupt handler to the appropriate device driver's handler. 5) Errors: PCIe error message are intercepted by the root ports normally (e.g., AER), except with VMD, system errors (i.e., firmware first) are disabled by default. AER and hotplug interrupts are translated in the same way as endpoint interrupts. 6) VMD does not support INTx interrupts or IO ports. Devices or drivers requiring these features should either not be placed below VMD-owned root ports, or VMD should be disabled by BIOS for such endpoints. [bhelgaas: add VMD BAR #defines, factor out vmd_cfg_addr(), rework VMD resource setup, whitespace, changelog] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> (IRQ-related parts)
2016-01-13 04:18:10 +08:00
config VMD
depends on PCI_MSI
tristate "Volume Management Device Driver"
default N
---help---
Adds support for the Intel Volume Management Device (VMD). VMD is a
secondary PCI host bridge that allows PCI Express root ports,
and devices attached to them, to be removed from the default
PCI domain and placed within the VMD domain. This provides
more bus resources than are otherwise possible with a
single domain. If you know your system provides one of these and
has devices attached to it, say Y; if you are not sure, say N.
source "net/Kconfig"
source "drivers/Kconfig"
source "drivers/firmware/Kconfig"
source "fs/Kconfig"
source "arch/x86/Kconfig.debug"
source "security/Kconfig"
source "crypto/Kconfig"
source "arch/x86/kvm/Kconfig"
source "lib/Kconfig"