In order to be able to debug things like the X server and programs using
the PPC Cell SPUs, the debugger needs to be able to access device memory
through ptrace and /proc/pid/mem.
This patch:
Add the generic_access_phys access function and put the hooks in place
to allow access_process_vm to access device or PPC Cell SPU memory.
[riel@redhat.com: Add documentation for the vm_ops->access function]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Benjamin Herrensmidt <benh@kernel.crashing.org>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are no users of nopfn in the tree. Remove it.
[hugh@veritas.com: fix build error]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
generic_file_direct_IO is a common helper around the invocation of
->direct_IO. But there's almost nothing shared between the read and write
side, so we're better off without this helper.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's confusing that set_max_huge_pages() contained two different
variables named "ret", and although the code works correctly this should
be fixed.
The inner of the two variables can simply be removed.
Spotted by sparse.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds proper extern declarations for five variables in
include/linux/vmstat.h
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every file should include the headers containing the externs for its
global functions (in this case for sys_move_pages()).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two zonelist patch series rewrote __page_alloc() largely. Now, it is just
a wrapper function. Inlining them will save a function call.
[akpm@linux-foundation.org: export __alloc_pages_internal]
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function has no external callers, so unexport it. Also fix its naming
inconsistency.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All _core functions only need the bootmem data, not the whole node descriptor.
Adjust the two functions that take the node descriptor unneededly.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check for node_boot_start is bogus because we start freeing at the
corresponding pfn. So check if the pfn is properly aligned instead in a more
readable way and adjust the documentation.
Also remove an unneeded accounting variable.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a lot of places that define either a single bootmem descriptor or an
array of them. Use only one central array with MAX_NUMNODES items instead.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch prints out the zonelists during boot for manual verification by the
user if the mminit_loglevel is MMINIT_VERIFY or higher.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a number of different views to how much memory is currently active.
There is the arch-independent zone-sizing view, the bootmem allocator and
memory models view.
Architectures register this information at different times and is not
necessarily in sync particularly with respect to some SPARSEMEM limitations.
This patch introduces mminit_validate_memmodel_limits() which is able to
validate and correct PFN ranges with respect to the memory model. It is only
SPARSEMEM that currently validates itself.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print out information on how the page flags are being used if mminit_loglevel
is MMINIT_VERIFY or higher and unconditionally performs sanity checks on the
flags regardless of loglevel.
When the page flags are updated with section, node and zone information, a
check are made to ensure the values can be retrieved correctly. Finally we
confirm that pfn_to_page and page_to_pfn are the correct inverse functions.
[akpm@linux-foundation.org: fix printk warnings]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Boot initialisation is very complex, with significant numbers of
architecture-specific routines, hooks and code ordering. While significant
amounts of the initialisation is architecture-independent, it trusts the data
received from the architecture layer. This is a mistake, and has resulted in
a number of difficult-to-diagnose bugs.
This patchset adds some validation and tracing to memory initialisation. It
also introduces a few basic defensive measures. The validation code can be
explicitly disabled for embedded systems.
This patch:
Add additional debugging and verification code for memory initialisation.
Once enabled, the verification checks are always run and when required
additional debugging information may be outputted via a mminit_loglevel=
command-line parameter.
The verification code is placed in a new file mm/mm_init.c. Ideally other mm
initialisation code will be moved here over time.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
cpumask: Provide a generic set of CPUMASK_ALLOC macros
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
Revert "cpumask: introduce new APIs"
cpumask: make for_each_cpu_mask a bit smaller
net: Pass reference to cpumask variable in net/sunrpc/svc.c
...
Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: dump more data on slab corruption
SLUB: simplify re on_each_cpu()
Hash et al. sizing code in SCTP wants to make the
calculation totalram_pages - totalhigh_pages, just
like TCP. But this requires an export for the
CONFIG_HIGHMEM case to work.
Signed-off-by: David S. Miller <davem@davemloft.net>
The limit of 128 bytes is too small when debugging slab corruption of the skb
cache, for example. So increase the limit to PAGE_SIZE to make debugging
corruptions easier.
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
on_each_cpu() expands to function call on UP, too.
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slab: rename slab_destroy_objs
slub: current is always valid
slub: Add check for kfree() of non slab objects.
With the removal of destructors, slab_destroy_objs no longer actually
destroys any objects, making the kernel doc incorrect and the function
name misleading.
In keeping with the other debug functions, rename it to
slab_destroy_debugcheck and drop the kernel doc.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
We can detect kfree()s on non slab objects by checking for PageCompound().
Works in the same way as for ksize. This helped me catch an invalid
kfree().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits)
ext4: Documention update for new ordered mode and delayed allocation
ext4: do not set extents feature from the kernel
ext4: Don't allow nonextenst mount option for large filesystem
ext4: Enable delalloc by default.
ext4: delayed allocation i_blocks fix for stat
ext4: fix delalloc i_disksize early update issue
ext4: Handle page without buffers in ext4_*_writepage()
ext4: Add ordered mode support for delalloc
ext4: Invert lock ordering of page_lock and transaction start in delalloc
mm: Add range_cont mode for writeback
ext4: delayed allocation ENOSPC handling
percpu_counter: new function percpu_counter_sum_and_set
ext4: Add delayed allocation support in data=writeback mode
vfs: add hooks for ext4's delayed allocation support
jbd2: Remove data=ordered mode support using jbd buffer heads
ext4: Use new framework for data=ordered mode in JBD2
jbd2: Implement data=ordered mode handling via inodes
vfs: export filemap_fdatawrite_range()
ext4: Fix lock inversion in ext4_ext_truncate()
ext4: Invert the locking order of page_lock and transaction start
...
* 'tracing/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (228 commits)
ftrace: build fix for ftraced_suspend
ftrace: separate out the function enabled variable
ftrace: add ftrace_kill_atomic
ftrace: use current CPU for function startup
ftrace: start wakeup tracing after setting function tracer
ftrace: check proper config for preempt type
ftrace: trace schedule
ftrace: define function trace nop
ftrace: move sched_switch enable after markers
ftrace: prevent ftrace modifications while being kprobe'd, v2
fix "ftrace: store mcount address in rec->ip"
mmiotrace broken in linux-next (8-bit writes only)
ftrace: avoid modifying kprobe'd records
ftrace: freeze kprobe'd records
kprobes: enable clean usage of get_kprobe
ftrace: store mcount address in rec->ip
ftrace: build fix with gcc 4.3
namespacecheck: fixes
ftrace: fix "notrace" filtering priority
ftrace: fix printout
...
* git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen/avr32-2.6: (31 commits)
avr32: Fix typo of IFSR in a comment in the PIO header file
avr32: Power Management support ("standby" and "mem" modes)
avr32: Add system device for the internal interrupt controller (intc)
avr32: Add simple SRAM allocator
avr32: Enable SDRAMC clock at startup
rtc-at32ap700x: Enable wakeup
macb: Basic suspend/resume support
atmel_serial: Drain console TX shifter before suspending
atmel_serial: Fix build on avr32 with CONFIG_PM enabled
avr32: Use a quicklist for PTE allocation as well
avr32: Use a quicklist for PGD allocation
avr32: Cover the kernel page tables in the user PGDs
avr32: Store virtual addresses in the PGD
avr32: Remove useless zeroing of swapper_pg_dir at startup
avr32: Clean up and optimize the TLB operations
avr32: Rename at32ap.c -> pdc.c
avr32: Move setup_platform() into chip-specific file
avr32: Kill special exception handler sections
avr32: Kill unneeded #include <asm/pgalloc.h> from asm/mmu_context.h
avr32: Clean up time.c #includes
...
Filesystems like ext4 needs to start a new transaction in
the writepages for block allocation. This happens with delayed
allocation and there is limit to how many credits we can request
from the journal layer. So we call write_cache_pages multiple
times with wbc->nr_to_write set to the maximum possible value
limitted by the max journal credits available.
Add a new mode to writeback that enables us to handle this
behaviour. In the new mode we update the wbc->range_start
to point to the new offset to be written. Next call to
call to write_cache_pages will start writeout from specified
range_start offset. In the new mode we also limit writing
to the specified wbc->range_end.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Make filemap_fdatawrite_range() function public, so that it can later
be used in ordered mode rewrite by JBD/JBD2.
Signed-off-by: Jan Kara <jack@suse.cz>
Vegard Nossum reported a crash in kmem_cache_alloc():
BUG: unable to handle kernel paging request at da87d000
IP: [<c01991c7>] kmem_cache_alloc+0xc7/0xe0
*pde = 28180163 *pte = 1a87d160
Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
EIP: 0060:[<c01991c7>] EFLAGS: 00210203 CPU: 0
EIP is at kmem_cache_alloc+0xc7/0xe0
EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
and analyzed it:
"The register %ecx looks innocent but is very important here. The disassembly:
mov %edx,%ecx
shr $0x2,%ecx
rep stos %eax,%es:(%edi) <-- the fault
So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
(0x6b6b6b6b >> 2 == 0x1adadada.)
%ecx is the counter for the memset, from here:
memset(object, 0, c->objsize);
i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
Where did "c" come from? Uh-oh...
c = get_cpu_slab(s, smp_processor_id());
This looks like it has very much to do with CPU hotplug/unplug. Is
there a race between SLUB/hotplug since the CPU slab is used after it
has been freed?"
Good analysis.
Yeah, it's possible that a caller of kmem_cache_alloc() -> slab_alloc()
can be migrated on another CPU right after local_irq_restore() and
before memset(). The inital cpu can become offline in the mean time (or
a migration is a consequence of the CPU going offline) so its
'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback).
At some point of time the caller continues on another CPU having an
obsolete pointer...
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch allows architectures to define functions to deal with
additional protections bits for mmap() and mprotect().
arch_calc_vm_prot_bits() maps additonal protection bits to vm_flags
arch_vm_get_page_prot() maps additional vm_flags to the vma's vm_page_prot
arch_validate_prot() checks for valid values of the protection bits
Note: vm_get_page_prot() is now pretty ugly, but the generated code
should be identical for architectures that don't define additional
protection bits.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fix some problems with (and applies on top of) a previous patch:
x86 boot: show pfn addresses in hex not decimal in some kernel info printks
Primarily change "0x%8lx" format, which displays with a right aligned
space filled hex number (spaces between the "0x" prefix and the number),
into "%0#10lx" format, which zero fills instead of space fills, and
which uses the printf flag '#' to request the "0x" prefix instead of
hard coding it.
Also replace some other "0x%lx" formats with "%#lx", making use of the
'#' printf flag again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Cc: "Jack Steiner" <steiner@sgi.com>
Cc: "Mike Travis" <travis@sgi.com>
Cc: "Huang
Cc: Ying" <ying.huang@intel.com>
Cc: "Andi Kleen" <andi@firstfloor.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Everywhere I look, node id's are of type 'int', except in this one
case, which has 'unsigned long'. Change this one to 'int' as well.
There is nothing special about the way this variable 'nid' is used in
this routine to justify using an unusual type here.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Cc: "Jack Steiner" <steiner@sgi.com>
Cc: "Mike Travis" <travis@sgi.com>
Cc: "Huang
Cc: Ying" <ying.huang@intel.com>
Cc: "Andi Kleen" <andi@firstfloor.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Page frame numbers (the portion of physical addresses above the low
order page offsets) are displayed in several kernel debug and info
prints in decimal, not hex. Decimal addresse are unreadable. Use hex.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: "Yinghai Lu" <yhlu.kernel@gmail.com>
Cc: "Jack Steiner" <steiner@sgi.com>
Cc: "Mike Travis" <travis@sgi.com>
Cc: "Huang
Cc: Ying" <ying.huang@intel.com>
Cc: "Andi Kleen" <andi@firstfloor.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
want to remove arch_get_ram_range, and use early_node_map instead.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use early_node_map to init high pages, so we can remove page_is_ram() and
page_is_reserved_early() in the big loop with add_one_highpage
also remove page_is_reserved_early(), it is not needed anymore.
v2: fix the build of other platforms
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
in case we have kva before ramdisk on a node, we still need to use
those ranges.
v2: reserve_early kva ram area, in case there are holes in highmem, to avoid
those area could be treat as free high pages.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Flags considered internal to the mempolicy kernel code are stored as part
of the "flags" member of struct mempolicy.
Before exposing a policy type to userspace via get_mempolicy(), these
internal flags must be masked. Flags exposed to userspace, however,
should still be returned to the user.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_user_pages() must not return the error when i != 0. When pages !=
NULL we have i get_page()'ed pages.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dirty page accounting accurately measures the amound of dirty pages in
writable shared mappings by mapping the pages RO (as indicated by
vma_wants_writenotify). We then trap on first write and call
set_page_dirty() on the page, after which we map the page RW and
continue execution.
When we launder dirty pages, we call clear_page_dirty_for_io() which
clears both the dirty flag, and maps the page RO again before we start
writeout so that the story can repeat itself.
vma_wants_writenotify() excludes VM_PFNMAP on the basis that we cannot
do the regular dirty page stuff on raw PFNs and the memory isn't going
anywhere anyway.
The recently introduced VM_MIXEDMAP mixes both !pfn_valid() and
pfn_valid() pages in a single mapping.
We can't do dirty page accounting on !pfn_valid() pages as stated
above, and mapping them RO causes them to be COW'ed on write, which
breaks VM_SHARED semantics.
Excluding VM_MIXEDMAP in vma_wants_writenotify() would mean we don't do
the regular dirty page accounting for the pfn_valid() pages, which
would bring back all the head-aches from inaccurate dirty page
accounting.
So instead, we let the !pfn_valid() pages get mapped RO, but fix them
up unconditionally in the fault path.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: "Jared Hulbert" <jaredeh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove all clameter@sgi.com addresses from the kernel tree since they will
become invalid on June 27th. Change my maintainer email address for the
slab allocators to cl@linux-foundation.org (which will be the new email
address for the future).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Do not use 192 byte sized cache if minimum alignment is 128 byte
The non-NUMA case of build_zonelist_cache() would initialize the
zlcache_ptr for both node_zonelists[] to NULL.
Which is problematic, since non-NUMA only has a single node_zonelists[]
entry, and trying to zero the non-existent second one just overwrote the
nr_zones field instead.
As kswapd uses this value to determine what reclaim work is necessary,
the result is that kswapd never reclaims. This causes processes to
stall frequently in low-memory situations as they always direct reclaim.
This patch initialises zlcache_ptr correctly.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Dan Williams <dan.j.williams@intel.com>
[ Simplified patch a bit ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 192 byte cache is not necessary if we have a basic alignment of 128
byte. If it would be used then the 192 would be aligned to the next 128 byte
boundary which would result in another 256 byte cache. Two 256 kmalloc caches
cause sysfs to complain about a duplicate entry.
MIPS needs 128 byte aligned kmalloc caches and spits out warnings on boot without
this patch.
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Using a quicklist to allocate PTEs might be slightly faster than using
the page allocator directly since we might avoid zeroing the page
after each allocation.
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is a race in the COW logic. It contains a shortcut to avoid the
COW and reuse the page if we have the sole reference on the page,
however it is possible to have two racing do_wp_page()ers with one
causing the other to mistakenly believe it is safe to take the shortcut
when it is not. This could lead to data corruption.
Process 1 and process2 each have a wp pte of the same anon page (ie.
one forked the other). The page's mapcount is 2. Then they both
attempt to write to it around the same time...
proc1 proc2 thr1 proc2 thr2
CPU0 CPU1 CPU3
do_wp_page() do_wp_page()
trylock_page()
can_share_swap_page()
load page mapcount (==2)
reuse = 0
pte unlock
copy page to new_page
pte lock
page_remove_rmap(page);
trylock_page()
can_share_swap_page()
load page mapcount (==1)
reuse = 1
ptep_set_access_flags (allow W)
write private key into page
read from page
ptep_clear_flush()
set_pte_at(pte of new_page)
Fix this by moving the page_remove_rmap of the old page after the pte
clear and flush. Potentially the entire branch could be moved down
here, but in order to stay consistent, I won't (should probably move all
the *_mm_counter stuff with one patch).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 89f5b7da2a ("Reinstate ZERO_PAGE
optimization in 'get_user_pages()' and fix XIP") broke vmware, as
reported by Jeff Chua:
"This broke vmware 6.0.4.
Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
/build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"
and the reason seems to be that there's an old bug in how we handle do
FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
triggered if the whole page table was missing, nobody had apparently hit
it before.
The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
not just for whole missing page tables, but for individual pages as
well, and exposed this problem.
This fixes it by making the test for when FOLL_ANON is used more
careful, and also makes the code easier to read and understand by moving
the logic to a separate inline function.
Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zonelist patches caused the loop that checks for available
objects in permitted zones to not terminate immediately. One object
per zone per allocation may be allocated and then abandoned.
Break the loop when we have successfully allocated one object.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes the function reserve_bootmem_node() from void to int,
returning -ENOMEM if the allocation fails.
This fixes a build problem on x86 with CONFIG_KEXEC=y and
CONFIG_NEED_MULTIPLE_NODES=y
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Reported-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
557ed1fa26 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.
We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.
In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.
This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.
While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.
We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *". The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.
[ Also removed an impossible test for pte_offset_map_lock() failing:
that's not how that function works ]
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need this at least for huge page detection for now, because powerpc
needs the vm_area_struct to be able to determine whether a virtual address
is referring to a huge page (its pmd_huge() doesn't work).
It might also come in handy for some of the other users.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"Smarter retry of costly-order allocations" patch series change behaver of
do_try_to_free_pages(). But unfortunately ret variable type was
unchanged.
Thus an overflow is possible.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This implements a few changes on top of the recent kobjsize() refactoring
introduced by commit 6cfd53fc03.
As Christoph points out:
virt_to_head_page cannot return NULL. virt_to_page also
does not return NULL. pfn_valid() needs to be used to
figure out if a page is valid. Otherwise the page struct
reference that was returned may have PageReserved() set
to indicate that it is not a valid page.
As discussed further in the thread, virt_addr_valid() is the preferable
way to validate the object pointer in this case. In addition to fixing
up the reserved page case, it also has the benefit of encapsulating the
hack introduced by commit 4016a1390d on
the impacted platforms, allowing us to get rid of the extra checking in
kobjsize() for the platforms that don't perform this type of bizarre
memory_end abuse (every nommu platform that isn't blackfin). If blackfin
decides to get in line with every other platform and use PageReserved
for the DMA pages in question, kobjsize() will also continue to work
fine.
It also turns out that compound_order() will give us back 0-order for
non-head pages, so we can get rid of the PageCompound check and just
use compound_order() directly. Clean that up while we're at it.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we are using register_e820_active_regions() instead of
add_active_range() directly. So end_pfn could be different between the
value in early_node_map to node_end_pfn.
So we need to make shrink_active_range() smarter.
shrink_active_range() is a generic MM function in mm/page_alloc.c but
it is only used on 32-bit x86. Should we move it back to some file in
arch/x86?
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Minor source code cleanup of page flags in mm/page_alloc.c.
Move the definition of the groups of bits to page-flags.h.
The purpose of this clean up is that the next patch will
conditionally add a page flag to the groups. Doing that
in a header file is cleaner than adding #ifdefs to the
C code.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kobjsize() has been abusing page->index as a method for sorting out
compound order, which blows up both for page cache pages, and SLOB's
reuse of the index in struct slob_page.
Presently we are not able to accurately size arbitrary pointers that
don't come from kmalloc(), so the best we can do is sort out the
compound order from the head page if it's a compound page, or default
to 0-order if it's impossible to ksize() the object.
Obviously this leaves quite a bit to be desired in terms of object
sizing accuracy, but the behaviour is unchanged over the existing
implementation, while fixing the page->index oopses originally reported
here:
http://marc.info/?l=linux-mm&m=121127773325245&w=2
Accuracy could also be improved by having SLUB and SLOB both set PG_slab
on ksizeable pages, rather than just handling the __GFP_COMP cases
irregardless of the PG_slab setting, as made possibly with Pekka's
patches:
http://marc.info/?l=linux-kernel&m=121139439900534&w=2http://marc.info/?l=linux-kernel&m=121139440000537&w=2http://marc.info/?l=linux-kernel&m=121139440000540&w=2
This is primarily a bugfix for nommu systems for 2.6.26, with the aim
being to gradually kill off kobjsize() and its particular brand of
object abuse entirely.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a regression introduced by
commit 4cc6028d40
Author: Jiri Kosina <jkosina@suse.cz>
Date: Wed Feb 6 22:39:44 2008 +0100
brk: check the lower bound properly
The check in sys_brk() on minimum value the brk might have must take
CONFIG_COMPAT_BRK setting into account. When this option is turned on
(i.e. we support ancient legacy binaries, e.g. libc5-linked stuff), the
lower bound on brk value is mm->end_code, otherwise the brk start is
allowed to be arbitrarily shifted.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trying to add memory via add_memory() from within an initcall function
results in
bootmem alloc of 163840 bytes failed!
Kernel panic - not syncing: Out of memory
This is caused by zone_wait_table_init() which uses system_state to decide
if it should use the bootmem allocator or not.
When initcalls are handled the system_state is still SYSTEM_BOOTING but
the bootmem allocator doesn't work anymore. So the allocation will fail.
To fix this use slab_is_available() instead as indicator like we do it
everywhere else.
[akpm@linux-foundation.org: coding-style fix]
Reviewed-by: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing
panics when trying node local allocations:
BUG: unable to handle kernel NULL pointer dereference at 0000034c
IP: [<c1042507>] get_page_from_freelist+0x4a/0x18e
*pdpt = 00000000013a7001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:
Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82)
EIP: 0060:[<c1042507>] EFLAGS: 00010282 CPU: 0
EIP is at get_page_from_freelist+0x4a/0x18e
EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000
ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000)
Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0
f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000
00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0
Call Trace:
[<c10426e4>] __alloc_pages_internal+0x99/0x378
[<c10429ca>] __alloc_pages+0x7/0x9
[<c105e0e8>] kmem_getpages+0x66/0xef
[<c105ec55>] cache_grow+0x8f/0x123
[<c105f117>] ____cache_alloc_node+0xb9/0xe4
[<c105f427>] kmem_cache_alloc_node+0x92/0xd2
[<c122118c>] setup_cpu_cache+0xaf/0x177
[<c105e6ca>] kmem_cache_create+0x2c8/0x353
[<c13853af>] kmem_cache_init+0x1ce/0x3ad
[<c13755c5>] start_kernel+0x178/0x1ee
This occurs when we are scanning the zonelists looking for a ZONE_NORMAL
page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on
node 0, all other nodes are mapped above 4GB physical. Here is a dump
of the zonelists from this system:
zonelists pgdat=c1400000
0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0
1: c14006c0:2 c1400360:1 c1400000:0
zonelists pgdat=f7c00000
0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0
1: f7c006c0:2
zonelists pgdat=f7e00000
0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0
1: f7e006c0:2
When performing a node local allocation we call get_page_from_freelist()
looking for a page. It in turn calls first_zones_zonelist() which returns
a preferred_zone. Where there are no applicable zones this will be NULL.
However we use this unconditionally, leading to this panic.
Where there are no applicable zones there is no possibility of a successful
allocation, so simply fail the allocation.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The atomic_t type is 32bit but a 64bit system can have more than 2^32
pages of virtual address space available. Without this we overflow on
ludicrously large mappings
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In a zone's present pages number, account for all pages occupied by the
memory map, including a partial.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Take out an assertion to allow ->fault handlers to service PFNMAP regions.
This is required to reimplement .nopfn handlers with .fault handlers and
subsequently remove nopfn.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Jes Sorensen <jes@sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently there is no protection from the root user to use up all of
memory for trace buffers. If the root user allocates too many entries,
the OOM killer might start kill off all tasks.
This patch adds an algorith to check the following condition:
pages_requested > (freeable_memory + current_trace_buffer_pages) / 4
If the above is met then the allocation fails. The above prevents more
than 1/4th of freeable memory from being used by trace buffers.
To determine the freeable_memory, I made determine_dirtyable_memory in
mm/page-writeback.c global.
Special thanks goes to Peter Zijlstra for suggesting the above calculation.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change references from for_each_cpu_mask to for_each_cpu_mask_nr
where appropriate
Reviewed-by: Paul Jackson <pj@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a WARN_ON for pages that don't have PageSlab nor PageCompound set to catch
the worst abusers of ksize() in the kernel.
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
There is a race from when a device is created with device_create() and
then the drvdata is set with a call to dev_set_drvdata() in which a
sysfs file could be open, yet the drvdata will be NULL, causing all
sorts of bad things to happen.
This patch fixes the problem by using the new function,
device_create_vargs().
Many thanks to Arthur Jones <ajones@riverbed.com> for reporting the bug,
and testing patches out.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Arthur Jones <ajones@riverbed.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Although slob_alloc return NULL, __kmalloc_node returns NULL + align.
Because align always can be changed, it is very hard for debugging
problem of no page if it don't return NULL.
We have to return NULL in case of no page.
[penberg@cs.helsinki.fi: fix formatting as suggested by Matt.]
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Trying to online a new memory section that was added via memory hotplug
sometimes results in crashes when the new pages are added via __free_page.
Reason for that is that the pageblock bitmap isn't initialized and hence
contains random stuff. That means that get_pageblock_migratetype()
returns also random stuff and therefore
list_add(&page->lru,
&zone->free_area[order].free_list[migratetype]);
in __free_one_page() tries to do a list_add to something that isn't even
necessarily a list.
This happens since 86051ca5ea ("mm: fix
usemap initialization") which makes sure that the pageblock bitmap gets
only initialized for pages present in a zone. Unfortunately for hot-added
memory the zones "grow" after the memmap and the pageblock memmap have
been initialized. Which means that the new pages have an unitialized
bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span()
are moved to __add_zone() just before the initialization happens.
The patch also moves the two functions since __add_zone() is the only
caller and I didn't want to add a forward declaration.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a defect in mprotect, which lets the user change the page cache
type bits by-passing the kernel reserve_memtype and free_memtype
wrappers. Fix the problem by not letting mprotect change the PAT bits.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a check to online_pages() to test for failure of
walk_memory_resource(). This fixes a condition where a failure
of walk_memory_resource() can lead to online_pages() returning
success without the requested pages being onlined.
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Keith Mannthey <kmannth@us.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__add_zone calls memmap_init_zone twice if memory gets attached to an empty
zone. Once via init_currently_empty_zone and once explictly right after that
call.
Looks like this is currently not a bug, however the call is superfluous and
might lead to subtle bugs if memmap_init_zone gets changed. So make sure it
is called only once.
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
filemap_fault will go into an infinite loop if ->readpage() fails
asynchronously.
AFAICS the bug was introduced by this commit, which removed the wait after the
final readpage:
commit d00806b183
Author: Nick Piggin <npiggin@suse.de>
Date: Thu Jul 19 01:46:57 2007 -0700
mm: fix fault vs invalidate race for linear mappings
Fix by reintroducing the wait_on_page_locked() after ->readpage() to make sure
the page is up-to-date before jumping back to the beginning of the function.
I've noticed this while testing nfs exporting on fuse. The patch
fixes it.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a possible data race in the page table walking code. After the split
ptlock patches, it actually seems to have been introduced to the core code, but
even before that I think it would have impacted some architectures (powerpc
and sparc64, at least, walk the page tables without taking locks eg. see
find_linux_pte()).
The race is as follows:
The pte page is allocated, zeroed, and its struct page gets its spinlock
initialized. The mm-wide ptl is then taken, and then the pte page is inserted
into the pagetables.
At this point, the spinlock is not guaranteed to have ordered the previous
stores to initialize the pte page with the subsequent store to put it in the
page tables. So another Linux page table walker might be walking down (without
any locks, because we have split-leaf-ptls), and find that new pte we've
inserted. It might try to take the spinlock before the store from the other
CPU initializes it. And subsequently it might read a pte_t out before stores
from the other CPU have cleared the memory.
There are also similar races in higher levels of the page tables. They
obviously don't involve the spinlock, but could see uninitialized memory.
Arch code and hardware pagetable walkers that walk the pagetables without
locks could see similar uninitialized memory problems, regardless of whether
split ptes are enabled or not.
I prefer to put the barriers in core code, because that's where the higher
level logic happens, but the page table accessors are per-arch, and open-coding
them everywhere I don't think is an option. I'll put the read-side barriers
in alpha arch code for now (other architectures perform data-dependent loads
in order).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>