Introduce infrastructure for tracking kernel memory pages to a given
memcg. This will happen whenever the caller includes the flag
__GFP_KMEMCG flag, and the task belong to a memcg other than the root.
In memcontrol.h those functions are wrapped in inline acessors. The idea
is to later on, patch those with static branches, so we don't incur any
overhead when no mem cgroups with limited kmem are being used.
Users of this functionality shall interact with the memcg core code
through the following functions:
memcg_kmem_newpage_charge: will return true if the group can handle the
allocation. At this point, struct page is not
yet allocated.
memcg_kmem_commit_charge: will either revert the charge, if struct page
allocation failed, or embed memcg information
into page_cgroup.
memcg_kmem_uncharge_page: called at free time, will revert the charge.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This flag is used to indicate to the callees that this allocation is a
kernel allocation in process context, and should be accounted to current's
memcg.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the basic infrastructure for the accounting of kernel memory. To
control that, the following files are created:
* memory.kmem.usage_in_bytes
* memory.kmem.limit_in_bytes
* memory.kmem.failcnt
* memory.kmem.max_usage_in_bytes
They have the same meaning of their user memory counterparts. They
reflect the state of the "kmem" res_counter.
Per cgroup kmem memory accounting is not enabled until a limit is set for
the group. Once the limit is set the accounting cannot be disabled for
that group. This means that after the patch is applied, no behavioral
changes exists for whoever is still using memcg to control their memory
usage, until memory.kmem.limit_in_bytes is set for the first time.
We always account to both user and kernel resource_counters. This
effectively means that an independent kernel limit is in place when the
limit is set to a lower value than the user memory. A equal or higher
value means that the user limit will always hit first, meaning that kmem
is effectively unlimited.
People who want to track kernel memory but not limit it, can set this
limit to a very high number (like RESOURCE_MAX - 1page - that no one will
ever hit, or equal to the user memory)
[akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is just a cleanup patch for clarity of expression. In earlier
submissions, people asked it to be in a separate patch, so here it is.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_do_charge() was written before kmem accounting, and expects
three cases: being called for 1 page, being called for a stock of 32
pages, or being called for a hugepage. If we call for 2 or 3 pages (and
both the stack and several slabs used in process creation are such, at
least with the debug options I had), it assumed it's being called for
stock and just retried without reclaiming.
Fix that by passing down a minsize argument in addition to the csize.
And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
needed at all (and presumably is since it's there, perhaps to handle
races), then it should be extended to more than PAGE_SIZE, yet how far?
And should there be a retry count limit, of what? For now retry up to
COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if
__GFP_NORETRY.
v4: fixed nr pages calculation pointed out by Christoph Lameter.
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently have a percpu stock cache scheme that charges one page at a
time from memcg->res, the user counter. When the kernel memory controller
comes into play, we'll need to charge more than that.
This is because kernel memory allocations will also draw from the user
counter, and can be bigger than a single page, as it is the case with the
stack (usually 2 pages) or some higher order slabs.
[glommer@parallels.com: added a changelog ]
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add help info for CONFIG_MOVABLE_NODE and permit its selection.
This option allows the user to online all memory of a node as movable
memory. So that the whole node can be hotplugged. Users who don't use
the hotplug feature are also fine with this option on since they won't
online memory as movable.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
[akpm@linux-foundation.org: tweak help text]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While allocating pages using buddy allocator, the compound page is
probably split up to free pages. Under these circumstances, the compound
page should be destroyed by destroy_compound_page(). However, there is a
duplicate check to judge if the page is compound.
Remove the duplicate check since the compound_order() returns 0 when the
page doesn't have PG_head set in destroy_compound_page(). That is to say,
destroy_compound_page() needn't check PageHead().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This happens to do the right thing in all cases on fibre channel but not on
other media types
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Nagalakshmi Nandigama <nagalakshmi.nandigama@lsi.com>
Cc: Kashyap Desai <kashyap.desai@lsi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rationales from Eric:
So I just looked a little deeper and it appears architectures that do
not support atomic64_t are broken.
The generic atomic64 support came in 2009 to support the perf subsystem
with the expectation that all architectures would implement atomic64
support.
Furthermore upon inspection of the kernel atomic64_t is used in a fair
number of places beyond the performance counters:
block/blk-cgroup.c
drivers/acpi/apei/
drivers/block/rbd.c
drivers/crypto/nx/nx.h
drivers/gpu/drm/radeon/radeon.h
drivers/infiniband/hw/ipath/
drivers/infiniband/hw/qib/
drivers/staging/octeon/
fs/xfs/
include/linux/perf_event.h
include/net/netfilter/nf_conntrack_acct.h
kernel/events/
kernel/trace/
net/mac80211/key.h
net/rds/
The block control group, infiniband, xfs, crypto, 802.11, netfilter.
Nothing quite so fundamental as fs/namespace.c but definitely in
multiplatform-code that should work, and is already broken on those
architecutres.
Looking at the implementation of atomic64_add_return in lib/atomic64.c the
code looks as efficient as these kinds of things get.
Which leads me to the conclusion that we need atomic64 support on all
architectures.
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ensure that calls to d_find_alias() have a corresponding dput().
Signed-off-by: Cyril Roelandt <tipecaml@gmail.com>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Gilles Muller <Gilles.Muller@lip6.fr>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The array check is useless so remove it.
[akpm@linux-foundation.org: remove comment, per David]
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dput() was not called in the error path.
Signed-off-by: Cyril Roelandt <tipecaml@gmail.com>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes the iris driver use the platform API, so it is properly exposed
in /sys.
[akpm@linux-foundation.org: remove commented-out code, add missing space to printk, clean up code layout]
Signed-off-by: Shérab <Sebastien.Hinderer@ens-lyon.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The inb/outb macros for CRIS are broken from a number of points of view,
missing () around parameters and they have an unprotected if statement
in them. This was breaking the compile of IPMI on CRIS and thus I was
being annoyed by build regressions, so I fixed them.
Plus I don't think they would have worked at all, since the data values
were missing "&" and the outsl had a "3" instead of a "4" for the size.
From what I can tell, this stuff is not used at all, so this can't be
any more broken than it was before, anyway.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Mikael Starvik <starvik@axis.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes the checkpatch error and warning as below:
WARNING: space prohibited between function name and open parenthesis '('
ERROR: trailing statements should be on next line
Also, long comments are fixed for the preferred style and unnecessary
lines are removed.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull SLAB changes from Pekka Enberg:
"This contains preparational work from Christoph Lameter and Glauber
Costa for SLAB memcg and cleanups and improvements from Ezequiel
Garcia and Joonsoo Kim.
Please note that the SLOB cleanup commit from Arnd Bergmann already
appears in your tree but I had also merged it myself which is why it
shows up in the shortlog."
* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
mm/sl[aou]b: Common alignment code
slab: Use the new create_boot_cache function to simplify bootstrap
slub: Use statically allocated kmem_cache boot structure for bootstrap
mm, sl[au]b: create common functions for boot slab creation
slab: Simplify bootstrap
slub: Use correct cpu_slab on dead cpu
mm: fix slab.c kernel-doc warnings
mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN
slab: Ignore internal flags in cache creation
mm/slob: Use free_page instead of put_page for page-size kmalloc allocations
mm/sl[aou]b: Move common kmem_cache_size() to slab.h
mm/slob: Use object_size field in kmem_cache_size()
mm/slob: Drop usage of page->private for storing page-sized allocations
slub: Commonize slab_cache field in struct page
sl[au]b: Process slabinfo_show in common code
mm/sl[au]b: Move print_slabinfo_header to slab_common.c
mm/sl[au]b: Move slabinfo processing to slab_common.c
slub: remove one code path and reduce lock contention in __slab_free()
Pull (again) user namespace infrastructure changes from Eric Biederman:
"Those bugs, those darn embarrasing bugs just want don't want to get
fixed.
Linus I just updated my mirror of your kernel.org tree and it appears
you successfully pulled everything except the last 4 commits that fix
those embarrasing bugs.
When you get a chance can you please repull my branch"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Fix typo in description of the limitation of userns_install
userns: Add a more complete capability subset test to commit_creds
userns: Require CAP_SYS_ADMIN for most uses of setns.
Fix cap_capable to only allow owners in the parent user namespace to have caps.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAUM8oVROxKuMESys7AQJNCRAAtPxACfDAL0XFdLA21uP+kOID43pF6eUY
g/6bAwDk8n05YJ/OcgEsxU5X2L4UXSmDUPCpFOZtYT83+ksxgKOTz4A8++gMpSqO
FcfNgSoKv9Bdp5Io1wLWJUHHewgB+LorwH6dXN8EpLZ8jzPsAgCalJVy2RCGqsaV
MXzySB3N7iZS0KTqcB5ULDxLCjQhETPua8jMRy/LyweDLbIF587qglHN6YB+eCna
kuSknL5NW1PLXWvERFdcphn6mz9SBtrvJk6/RRS0q3/hiORT2qr+p+QEM+HTIIfU
hM/fGJSg7dH8Zgv1kfOZDW9HVr580Fk1NOi1XKrXaoT2yAxBpGFmpkUTCKGFYqeo
x7e5fswm0DPRMjPhrXv43qztjRKoMwmiW/A2ay28E/2A31+pYggcq1Vd3OXiPGrH
T3/Dk2stlx+CpUaO214sDONPUiCkvyP4x9hEbUHBZZnA9kKHThMpmZM68zS1YTTS
zlb6aAevtE2/fBmJ4wDkaDMcWUVjSa2VHjbabBkFzhN340Vw6o7EChdXTg4mvdnd
4m+3MguPbDg2KP1Q1mwpB4wqDG8dzEq7H3xokmOncgvPBpfWOSXfBTGUhMwrJGaX
HEuZWZTZJ+dzGg+q/GPN9/qxFOhHe2ria8y/mJYwM+qrNm2lYDTU/NEtpn0TZhe0
PfAw8/PLyfI=
=RLH1
-----END PGP SIGNATURE-----
Merge tag 'disintegrate-alpha-20121217' of git://git.infradead.org/users/dhowells/linux-headers
Pull UAPI disintegration for Alpha from David Howells:
"I've been asked to send the Alpha UAPI disintegration to you directly.
The acks I have been given have been added into the patch."
* tag 'disintegrate-alpha-20121217' of git://git.infradead.org/users/dhowells/linux-headers:
UAPI: (Scripted) Disintegrate arch/alpha/include/asm
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iEYEABECAAYFAlDQeY8ACgkQ70gcjN2673NqLACgk1Uiw64UHI+uDVeB4mXYeUrO
C8UAnAgHumM61Mm5oiGRa1XzGSvxHQdK
=wVDf
-----END PGP SIGNATURE-----
Merge tag 'for-3.8' of git://openrisc.net/~jonas/linux
Pull OpenRISC update from Jonas Bonn:
"Trivial cleanups for OpenRISC."
* tag 'for-3.8' of git://openrisc.net/~jonas/linux:
openrisc: use kbuild.h instead of defining macros in asm-offset.c
openrisc: Use Kbuild infrastructure for kvm_para.h
Pull s390 update #2 from Martin Schwidefsky:
"The main patch is the function measurement blocks extension for PCI to
do performance statistics and help with debugging. The other patch is
a small cleanup in ccwdev.h."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/ccwdev: Include asm/schid.h.
s390/pci: performance statistics and debug infrastructure
Pull powerpc update from Benjamin Herrenschmidt:
"The main highlight is probably some base POWER8 support. There's more
to come such as transactional memory support but that will wait for
the next one.
Overall it's pretty quiet, or rather I've been pretty poor at picking
things up from patchwork and reviewing them this time around and Kumar
no better on the FSL side it seems..."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (73 commits)
powerpc+of: Rename and fix OF reconfig notifier error inject module
powerpc: mpc5200: Add a3m071 board support
powerpc/512x: don't compile any platform DIU code if the DIU is not enabled
powerpc/mpc52xx: use module_platform_driver macro
powerpc+of: Export of_reconfig_notifier_[register,unregister]
powerpc/dma/raidengine: add raidengine device
powerpc/iommu/fsl: Add PAMU bypass enable register to ccsr_guts struct
powerpc/mpc85xx: Change spin table to cached memory
powerpc/fsl-pci: Add PCI controller ATMU PM support
powerpc/86xx: fsl_pcibios_fixup_bus requires CONFIG_PCI
drivers/virt: the Freescale hypervisor driver doesn't need to check MSR[GS]
powerpc/85xx: p1022ds: Use NULL instead of 0 for pointers
powerpc: Disable relocation on exceptions when kexecing
powerpc: Enable relocation on during exceptions at boot
powerpc: Move get_longbusy_msecs into hvcall.h and remove duplicate function
powerpc: Add wrappers to enable/disable relocation on exceptions
powerpc: Add set_mode hcall
powerpc: Setup relocation on exceptions for bare metal systems
powerpc: Move initial mfspr LPCR out of __init_LPCR
powerpc: Add relocation on exception vector handlers
...
With CONFIG_PARAVIRT=y and CONFIG_TRANSPARENT_HUGEPAGE=n, the build breaks
because set_pmd_at() is undeclared:
mm/memory.c: In function 'do_pmd_numa_page':
mm/memory.c:3520: error: implicit declaration of function 'set_pmd_at'
mm/mprotect.c: In function 'change_pmd_protnuma':
mm/mprotect.c:120: error: implicit declaration of function 'set_pmd_at'
This is because paravirt defines set_pmd_at() only when
CONFIG_TRANSPARENT_HUGEPAGE=y and such a restriction is unneeded. The
fix is to define it for all CONFIG_PARAVIRT configurations.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull exofs changes from Boaz Harrosh:
"These are just 3 patches, the last two are bug fixes on the error
paths in exofs.
The important patch is the one to osd_uld which adds sysfs info to osd
devices for use by user-mode clustering discovery software. I'm
already sitting on this patch since before February this year, It is
important for some of the big installation cluster systems, who's been
compiling their own kernel just for that patch."
Ugh. The osd_uld patch already went through the SCSI tree, so this was
kind of pointless. But at least it has the two small error-path fixes..
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: don't leak io_state and pages on read error
exofs: clean up the correct page collection on write error
osduld: Add osdname & systemid sysfs at scsi_osd class
Pull btrfs update from Chris Mason:
"A big set of fixes and features.
In terms of line count, most of the code comes from Stefan, who added
the ability to replace a single drive in place. This is different
from how btrfs normally replaces drives, and is much much much faster.
Josef is plowing through our synchronous write performance. This pull
request does not include the DIO_OWN_WAITING patch that was discussed
on the list, but it has a number of other improvements to cut down our
latencies and CPU time during fsync/O_DIRECT writes.
Miao Xie has a big series of fixes and is spreading out ordered
operations over more CPUs. This improves performance and reduces
contention.
I've put in fixes for error handling around hash collisions. These
are going back to individual stable kernels as I test against them.
Otherwise we have a lot of fixes and cleanups, thanks everyone!
raid5/6 is being rebased against the device replacement code. I'll
have it posted this Friday along with a nice series of benchmarks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
Btrfs: fix a bug of per-file nocow
Btrfs: fix hash overflow handling
Btrfs: don't take inode delalloc mutex if we're a free space inode
Btrfs: fix autodefrag and umount lockup
Btrfs: fix permissions of empty files not affected by umask
Btrfs: put raid properties into global table
Btrfs: fix BUG() in scrub when first superblock reading gives EIO
Btrfs: do not call file_update_time in aio_write
Btrfs: only unlock and relock if we have to
Btrfs: use tokens where we can in the tree log
Btrfs: optimize leaf_space_used
Btrfs: don't memset new tokens
Btrfs: only clear dirty on the buffer if it is marked as dirty
Btrfs: move checks in set_page_dirty under DEBUG
Btrfs: log changed inodes based on the extent map tree
Btrfs: add path->really_keep_locks
Btrfs: do not mark ems as prealloc if we are writing to them
Btrfs: keep track of the extents original block length
Btrfs: inline csums if we're fsyncing
Btrfs: don't bother copying if we're only logging the inode
...
Features include:
- Full audit of BUG_ON asserts in the NFS, SUNRPC and lockd client code
Remove altogether where possible, and replace with WARN_ON_ONCE and
appropriate error returns where not.
- NFSv4.1 client adds session dynamic slot table management. There is
matching server side code that has been submitted to Bruce for
consideration. Together, this code allows the server to dynamically
manage the amount of memory it allocates to the duplicate request
cache for each client. It will constantly resize those caches to
reserve more memory for clients that are hot while shrinking caches
for those that are quiescent.
In addition, there are assorted bugfixes for the generic NFS write code,
fixes to deal with the drop_nlink() warnings, and yet another fix for
NFSv4 getacl.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQz8VNAAoJEGcL54qWCgDy7iYQAKbr7AAZOcZPoJigzakZ7nMi
UKYulGbFais2Llwzw1e+U5RzmorTSbvl7/m8eS7pDf3auYw/t4xtXjKSGZUNxaE1
q2hNKgVwodMbScYdkZXvKKNckS93oPDttrmEyzjKanqey+1E3HSklvOvikN0ihte
B/G1OtA7Qpcr92bPrLK+PjDqarCBUI4g42dYbZOBrZnXKTRtzUqsuKPu7WjpPiof
SHE5b1Emt7oUxgcijWGcvYCQ8voZdeSCnSksH3DgvORlutwdhUD3Yg8KyEfFZdyc
6C59ozXRLiHkV3c+jMhJzDkQXR9bYHrnK3tlq4G8v1NdJxRktQliZeqecRvip/Wz
rAxfE6fnPDEvKsCpZb3+5yTAt+aZwzEhRg1fFC9qfGOp+oRa+CWw5kJCyIFHwJu6
4LOlubQAf6rnIsja1L8D0FdeqHUa1+wy61On5kgVYS5JGtoBsQHpa1zTwdOxPmsR
2XTMYGNCEabvpKpO9+5xQbUzkFExPTesw47ygXiUuDT/snaarpV3/f05SSCaWZkX
R8QsGEOXTIh8/S+UxARGpc7H6xi1PdBM5nBziHVzjEdHgZRF4wGFaJe2CirMjSJO
Df5GEd5Z/8VCGWs+1w7HD5EaQ2n0wbt5daCE80Y2jRBr7NMYnY+ciF8/GktLpHsn
Zq1bXGOdr3UZ92LXuzL9
=G3N9
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Features include:
- Full audit of BUG_ON asserts in the NFS, SUNRPC and lockd client
code. Remove altogether where possible, and replace with
WARN_ON_ONCE and appropriate error returns where not.
- NFSv4.1 client adds session dynamic slot table management. There
is matching server side code that has been submitted to Bruce for
consideration.
Together, this code allows the server to dynamically manage the
amount of memory it allocates to the duplicate request cache for
each client. It will constantly resize those caches to reserve
more memory for clients that are hot while shrinking caches for
those that are quiescent.
In addition, there are assorted bugfixes for the generic NFS write
code, fixes to deal with the drop_nlink() warnings, and yet another
fix for NFSv4 getacl."
* tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (106 commits)
SUNRPC: continue run over clients list on PipeFS event instead of break
NFS: Don't use SetPageError in the NFS writeback code
SUNRPC: variable 'svsk' is unused in function bc_send_request
SUNRPC: Handle ECONNREFUSED in xs_local_setup_socket
NFSv4.1: Deal effectively with interrupted RPC calls.
NFSv4.1: Move the RPC timestamp out of the slot.
NFSv4.1: Try to deal with NFS4ERR_SEQ_MISORDERED.
NFS: nfs_lookup_revalidate should not trust an inode with i_nlink == 0
NFS: Fix calls to drop_nlink()
NFS: Ensure that we always drop inodes that have been marked as stale
nfs: Remove unused list nfs4_clientid_list
nfs: Remove duplicate function declaration in internal.h
NFS: avoid NULL dereference in nfs_destroy_server
SUNRPC handle EKEYEXPIRED in call_refreshresult
SUNRPC set gss gc_expiry to full lifetime
nfs: fix page dirtying in NFS DIO read codepath
nfs: don't zero out the rest of the page if we hit the EOF on a DIO READ
NFSv4.1: Be conservative about the client highest slotid
NFSv4.1: Handle NFS4ERR_BADSLOT errors correctly
nfs: don't extend writes to cover entire page if pagecache is invalid
...
Mostly just little fixes. Probably biggest part is
AVX accelerated RAID6 calculations.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUM/w2Dnsnt1WYoG5AQKXlg/9F5juv4CjRkRRFLqZgOPBLmn/s/2Vspgh
2Kv8Jcyixd8jUQNbobZv0ahlJH/iSU61kpOE8QjLbKi5Y42vAbM0ZU2aHJ6nqGZy
HiTI8K+7kTvCK3ZXLcUQ+4oPPBNTcoTZbLWaEOmIqB1ruLddoIR7M9fG3PspVeG0
jijnXR8IfL6mr4YDXnJkEhFrneTysVik05RkKYZKyM/9r3stAoMJ9o0/EFy3OFxb
lO6mLEtvjVArXcnuf1RMCw2YKgki9Y4r73HCplgQsVFvcxcpsya4gFF+lRR5j7cO
/eMYbSQ89iWEYKh1dJ9u1nofc8fX5ia71QQyO1fkO4GXRHXPVIyBgKSbe7SaL6iG
JUMm7idUV2rZGeq3ln3k8Yor4QqHvN1n7pRKKUF+ZdsPoQ1B/TABu+qpsAdo5ZhP
fxDsULsHrzEaxgetd4V8F2Uptca9ni43sMI8mwsvVlA0p6SOzMIyoJLC9xAZpx11
b3H3+7Oje/fasmszBoq5B9uAlSt9XXVN4DDn2q6cX+S96JSX6jcsN1c6cJBO+ZxB
OU6a6P5mnU6HuxU02rspe7G8BeU+ybaonErOW+GdyC4r7M/cImC0dSp0NGHK2211
oqu0xBx/Q/ddTFwKQqa4HzR2ws09+LhKbjdqYIhCEKttIbLIAjf73ARZ19XPSRRX
pDR/ey2CB6E=
=uK52
-----END PGP SIGNATURE-----
Merge tag 'md-3.8' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly just little fixes. Probably biggest part is AVX accelerated
RAID6 calculations."
* tag 'md-3.8' of git://neil.brown.name/md:
md/raid5: add blktrace calls
md/raid5: use async_tx_quiesce() instead of open-coding it.
md: Use ->curr_resync as last completed request when cleanly aborting resync.
lib/raid6: build proper files on corresponding arch
lib/raid6: Add AVX2 optimized gen_syndrome functions
lib/raid6: Add AVX2 optimized recovery functions
md: Update checkpoint of resync/recovery based on time.
md:Add place to update ->recovery_cp.
md.c: re-indent various 'switch' statements.
md: close race between removing and adding a device.
md: removed unused variable in calc_sb_1_csm.
Fix up a trivial merge conflict with commit baaf1dd ("mm/slob: use
min_t() to compare ARCH_SLAB_MINALIGN") that did not go through the slab
tree.
Conflicts:
mm/slob.c
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Merge misc patches from Andrew Morton:
"Incoming:
- lots of misc stuff
- backlight tree updates
- lib/ updates
- Oleg's percpu-rwsem changes
- checkpatch
- rtc
- aoe
- more checkpoint/restart support
I still have a pile of MM stuff pending - Pekka should be merging
later today after which that is good to go. A number of other things
are twiddling thumbs awaiting maintainer merges."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
scatterlist: don't BUG when we can trivially return a proper error.
docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
fs, fanotify: add @mflags field to fanotify output
docs: add documentation about /proc/<pid>/fdinfo/<fd> output
fs, notify: add procfs fdinfo helper
fs, exportfs: add exportfs_encode_inode_fh() helper
fs, exportfs: escape nil dereference if no s_export_op present
fs, epoll: add procfs fdinfo helper
fs, eventfd: add procfs fdinfo helper
procfs: add ability to plug in auxiliary fdinfo providers
tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
breakpoint selftests: print failure status instead of cause make error
kcmp selftests: print fail status instead of cause make error
kcmp selftests: make run_tests fix
mem-hotplug selftests: print failure status instead of cause make error
cpu-hotplug selftests: print failure status instead of cause make error
mqueue selftests: print failure status instead of cause make error
vm selftests: print failure status instead of cause make error
ubifs: use prandom_bytes
mtd: nandsim: use prandom_bytes
...
When compiling efivars.c the build fails with:
CC drivers/firmware/efivars.o
drivers/firmware/efivars.c: In function ‘efivarfs_get_inode’:
drivers/firmware/efivars.c:886:31: error: incompatible types when assigning to type ‘kgid_t’ from type ‘int’
make[2]: *** [drivers/firmware/efivars.o] Error 1
make[1]: *** [drivers/firmware/efivars.o] Error 2
Fix the build error by removing the duplicate initialization of i_uid and
i_gid inode_init_always has already initialized them to 0.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This build error is currently hidden by the fact that the x86
implementation of 'update_mmu_cache_pmd()' is a macro that doesn't use
its last argument, but commit b32967ff10 ("mm: numa: Add THP migration
for the NUMA working set scanning fault case") introduced a call with
the wrong third argument.
In the akpm tree, it causes this build error:
mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
mm/migrate.c:1666:2: error: incompatible type for argument 3 of 'update_mmu_cache_pmd'
arch/x86/include/asm/pgtable.h:792:20: note: expected 'struct pmd_t *' but argument is of type 'pmd_t'
Fix it.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is absolutely no reason to crash the kernel when we have a
perfectly good return value already available to use for conveying
failure status.
Let's return an error code instead of crashing the kernel: that sounds
like a much better plan.
[akpm@linux-foundation.org: s/E2BIG/EINVAL/]
Signed-off-by: Nick Bowler <nbowler@elliptictech.com>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel keeps FAN_MARK_IGNORED_SURV_MODIFY bit separately from
fsnotify_mark::mask|ignored_mask thus put it in @mflags (mark flags)
field so the user-space reader will be able to detect if such bit were
used on mark creation procedure.
| pos: 0
| flags: 04002
| fanotify flags:10 event-flags:0
| fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
| fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We will need this helper in the next patch to provide a file handle for
inotify marks in /proc/pid/fdinfo output.
The patch is rather providing the way to use inodes directly when dentry
is not available (like in case of inotify system).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This routine will be used to generate a file handle in fdinfo output for
inotify subsystem, where if no s_export_op present the general
export_encode_fh should be used. Thus add a test if s_export_op present
inside exportfs_encode_fh itself.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch brings ability to print out auxiliary data associated with
file in procfs interface /proc/pid/fdinfo/fd.
In particular further patches make eventfd, evenpoll, signalfd and
fsnotify to print additional information complete enough to restore
these objects after checkpoint.
To simplify the code we add show_fdinfo callback inside struct
file_operations (as Al and Pavel are proposing).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I was curious why sys_kcmp wasn't working, which led me to the testcase.
It turned out I hadn't enabled CHECKPOINT_RESTORE in the kernel I was
testing. Add a decoding of errno to the testcase to make that obvious.
Signed-off-by: Dave Jones <davej@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case breakpoint test exit non zero value it will cause make error.
Better way is just print the test failure status.
Signed-off-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case kcmp_test exit non zero value it will cause make error.
Better way is just print the test failure status.
Signed-off-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
make run_tests need the target is run_tests instead of run-tests
Also gcc output should be kcmp_test. Fix these two issues.
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Original behavior:
bash-4.1$ make -C mqueue run_tests
make: Entering directory `/home/dave/git/linux-2.6/tools/testing/selftests/mqueue'
./mq_open_tests /test1
Not running as root, but almost all tests require root in order to modify
system settings. Exiting.
make: *** [run_tests] Error 1
make: Leaving directory `/home/dave/git/linux-2.6/tools/testing/selftests/mqueue'
After applying the patch:
bash-4.1$ make -C mqueue run_tests
make: Entering directory `/home/dave/git/linux-2.6/tools/testing/selftests/mqueue'
Not running as root, but almost all tests require root in order to modify
system settings. Exiting.
mq_open_tests: [FAIL]
Not running as root, but almost all tests require root in order to modify
system settings. Exiting.
mq_perf_tests: [FAIL]
make: Leaving directory `/home/dave/git/linux-2.6/tools/testing/selftests/mqueue'
Signed-off-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>