Fix a lockdep-splat-causing regression introduced by commit 989a297920
("fasync: RCU and fine grained locking").
kill_fasync() can be called from both process and hard-irq context, so
fa_lock must be taken with IRQs disabled.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16230
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Tested-by: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A call to sysv_write_inode() in sysv_new_inode() to its new interface that
replaced wait flag with writeback structure. This was broken by
a9185b41a4 ("pass writeback_control to
->write_inode").
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org> [2.6.34.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent commit 1f0ce8b3dd ("mm: Move ARCH_SLAB_MINALIGN and
ARCH_KMALLOC_MINALIGN to <linux/slab_def.h>") which moved the
ARCH_SLAB_MINALIGN default into the global header inadvertently broke FLAT
for a bunch of systems. Blackfin systems now fail on any FLAT exec with:
Unable to read code+data+bss, errno 14 When your /init is a FLAT binary,
obviously this can be annoying ;).
This stems from the alignment usage in the FLAT loader. The behavior
before was that FLAT would default to ARCH_SLAB_MINALIGN only if it was
defined, and this was only defined by arches when they wanted a larger
alignment value. Otherwise it'd default to pointer alignment. Arguably,
this is kind of hokey that the FLAT is semi-abusing defines it shouldn't.
So let's merge the two alignment requirements so the floor is never 0.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: David McCullough <davidm@snapgear.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support to the NOMMU /proc/pid/maps file to show which mapping is the stack
of the original thread after execve. This is largely based on the MMU code.
Subsidiary thread stacks are not indicated.
For FDPIC, we now get:
root:/> cat /proc/self/maps
02064000-02067ccc rw-p 0004d000 00:01 22 /bin/busybox
0206e000-0206f35c rw-p 00006000 00:01 295 /lib/ld-uClibc.so.0
025f0000-025f6f0c r-xs 00000000 00:01 295 /lib/ld-uClibc.so.0
02680000-026ba6b0 r-xs 00000000 00:01 297 /lib/libc.so.0
02700000-0274d384 r-xs 00000000 00:01 22 /bin/busybox
02816000-02817000 rw-p 00000000 00:00 0
02848000-0284c0d8 rw-p 00000000 00:00 0
02860000-02880000 rw-p 00000000 00:00 0 [stack]
The semi-downside here is that for FLAT, we get:
root:/> cat /proc/155/maps
029f0000-029f9000 rwxp 00000000 00:00 0 [stack]
The reason being that FLAT combines a whole lot of stuff into one map
(including the stack). But this isn't any worse than the current output
(which is nothing), so screw it.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Don't count_vm_events for discard bio in submit_bio.
cfq: fix recursive call in cfq_blkiocg_update_completion_stats()
cfq-iosched: Fixed boot warning with BLK_CGROUP=y and CFQ_GROUP_IOSCHED=n
cfq: Don't allow queue merges for queues that have no process references
block: fix DISCARD_BARRIER requests
cciss: set SCSI max cmd len to 16, as default is wrong
cpqarray: fix two more wrong section type
cpqarray: fix wrong __init type on pci probe function
drbd: Fixed a race between disk-attach and unexpected state changes
writeback: fix pin_sb_for_writeback
writeback: add missing requeue_io in writeback_inodes_wb
writeback: simplify and split bdi_start_writeback
writeback: simplify wakeup_flusher_threads
writeback: fix writeback_inodes_wb from writeback_inodes_sb
writeback: enforce s_umount locking in writeback_inodes_sb
writeback: queue work on stack in writeback_inodes_sb
writeback: fix writeback completion notifications
list_for_each_entry_safe is not suitable to protect against concurrent
modification of the list. 6754af6 introduced a race in sb walking.
list_for_each_entry can use the trick of pinning the current entry in
the list before we drop and retake the lock because it subsequently
follows cur->next. However list_for_each_entry_safe saves n=cur->next
for following before entering the loop body, so when the lock is
dropped, n may be deleted.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Frank Mayhar <fmayhar@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to increase the total and used counters when allocating a new cap
in the non-reserved (cap import) case.
Signed-off-by: Sage Weil <sage@newdream.net>
We can drop caps with an mds request. Ensure we only drop unused AND
clean caps, since the MDS doesn't support cap writeback in that context,
nor do we track it. If caps are dirty, and the MDS needs them back, we
it will revoke and we will flush in the normal fashion.
This fixes a possibly loss of metadata.
Signed-off-by: Sage Weil <sage@newdream.net>
Implicit slab.h inclusion via percpu.h is about to go away. Make sure
gfp.h or slab.h is included as necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFSv4: Fix an embarassing typo in encode_attrs()
NFSv4: Ensure that /proc/self/mountinfo displays the minor version number
NFSv4.1: Ensure that we initialise the session when following a referral
SUNRPC: Fix a re-entrancy bug in xs_tcp_read_calldir()
nfs4 use mandatory attribute file type in nfs4_get_root
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
ext3: update ctime when changing the file's permission by setfacl
ext2: update ctime when changing the file's permission by setfacl
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
MAINTAINERS: change mailing list address for CIFS
cifs: remove bogus first_time check in NTLMv2 session setup code
cifs: don't call cifs_new_fileinfo unless cifs_open succeeds
cifs: don't ignore cifs_posix_open_inode_helper return value
cifs: clean up arguments to cifs_open_inode_helper
cifs: pass instantiated filp back after open call
cifs: move cifs_new_fileinfo call out of cifs_posix_open
cifs: implement drop_inode superblock op
cifs: don't attempt busy-file rename unless it's in same directory
ext3 didn't update the ctime of the file when its permission was changed.
Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1275289822
# setfacl -m 'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1275289822 <- unchanged
But, according to the spec of the ctime, ext3 must update it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
ext2 didn't update the ctime of the file when its permission was changed.
Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1275289822
# setfacl -m 'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1275289822 <- unchanged
But, according to the spec of the ctime, ext2 must update it.
Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
We may not recurse for CHOOSE_LEAF if we start with a leaf node. When
that happens, the out2 vector needs to be filled in with the result.
Signed-off-by: Sage Weil <sage@newdream.net>
The block number comes from bulkstat based inode lookups to shortcut
the mapping calculations. We ar enot able to trust anything from
bulkstat, so drop the block number as well so that the correct
lookups and mappings are always done.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Inode numbers may come from somewhere external to the filesystem
(e.g. file handles, bulkstat information) and so are inherently
untrusted. Rename the flag we use for these lookups to make it
obvious we are doing a lookup of an untrusted inode number and need
to verify it completely before trying to read it from disk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we decode a handle or do a bulkstat lookup, we are using an
inode number we cannot trust to be valid. If we are deleting inode
chunks from disk (default noikeep mode), then we cannot trust the on
disk inode buffer for any given inode number to correctly reflect
whether the inode has been unlinked as the di_mode nor the
generation number may have been updated on disk.
This is due to the fact that when we delete an inode chunk, we do
not write the clusters back to disk when they are removed - instead
we mark them stale to avoid them being written back potentially over
the top of something that has been subsequently allocated at that
location. The result is that we can have locations of disk that look
like they contain valid inodes but in reality do not. Hence we
cannot simply convert the inode number to a block number and read
the location from disk to determine if the inode is valid or not.
As a result, and XFS_IGET_BULKSTAT lookup needs to actually look the
inode up in the inode allocation btree to determine if the inode
number is valid or not.
It should be noted even on ikeep filesystems, there is the
possibility that blocks on disk may look like valid inode clusters.
e.g. if there are filesystem images hosted on the filesystem. Hence
even for ikeep filesystems we really need to validate that the inode
number is valid before issuing the inode buffer read.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The non-coherent bulkstat versionsthat look directly at the inode
buffers causes various problems with performance optimizations that
make increased use of just logging inodes. This patch makes bulkstat
always use iget, which should be fast enough for normal use with the
radix-tree based inode cache introduced a while ago.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This patch prevents user "foo" from using the SWAPEXT ioctl to swap
a write-only file owned by user "bar" into a file owned by "foo" and
subsequently reading it. It does so by checking that the file
descriptors passed to the ioctl are also opened for reading.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Apparently, we have never been able to set the atime correctly from the
NFSv4 client.
Reported-by: 小倉一夫 <ka-ogura@bd6.so-net.ne.jp>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Currently, we do not display the minor version mount parameter in the
/proc mount info.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Put the code that is common to both the referral and ordinary mount cases
into a common helper routine.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
S_ISDIR(fsinfo.fattr->mode) checks the file type rather than the mode bits,
so we should be checking for the NFS_ATTR_FATTR_TYPE fattr property.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
This fixes a race between handle_reply finishing an mds request, signalling
completion, and then dropping the request structing and its dentry+inode
refs, and pre_umount function waiting for requests to finish before
letting the vfs tear down the dcache. If umount was delayed waiting for
mds requests, we could race and BUG in shrink_dcache_for_umount_subtree
because of a slow dput.
This delays umount until the msgr queue flushes, which means handle_reply
will exit and will have dropped the ceph_mds_request struct. I'm assuming
the VFS has already ensured that its calls have all completed and those
request refs have thus been dropped as well (I haven't seen that race, at
least).
Signed-off-by: Sage Weil <sage@newdream.net>
Handle a splice_dentry failure (due to a d_materialize_unique error)
without crashing. (Also, report the error code.)
Signed-off-by: Sage Weil <sage@newdream.net>
If the incremental osdmap has a new crush map, advance the position after
decoding so that we can parse the rest of the osdmap properly.
Signed-off-by: Sage Weil <sage@newdream.net>
This bug appears to be the result of a cut-and-paste mistake from the
NTLMv1 code. The function to generate the MAC key was commented out, but
not the conditional above it. The conditional then ended up causing the
session setup key not to be copied to the buffer unless this was the
first session on the socket, and that made all but the first NTLMv2
session setup fail.
Fix this by removing the conditional and all of the commented clutter
that made it difficult to see.
Cc: Stable <stable@kernel.org>
Reported-by: Gunther Deschner <gdeschne@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
It's currently possible for cifs_open to fail after it has already
called cifs_new_fileinfo. In that situation, the new fileinfo will be
leaked as the caller doesn't call fput. That in turn leads to a busy
inodes after umount problem since the fileinfo holds an extra inode
reference now. Shuffle cifs_open around a bit so that it only calls
cifs_new_fileinfo if it's going to succeed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
...and ensure that we propagate the error back to avoid any surprises.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
...which takes a ton of unneeded arguments and does a lot more pointer
dereferencing than is really needed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
The current scheme of sticking open files on a list and assuming that
cifs_open will scoop them off of it is broken and leads to "Busy
inodes after umount..." errors at unmount time.
The problem is that there is no guarantee that cifs_open will always
be called after a ->lookup or ->create operation. If there are
permissions or other problems, then it's quite likely that it *won't*
be called.
Fix this by fully instantiating the filp whenever the file is created
and pass that filp back to the VFS. If there is a problem, the VFS
can clean up the references.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Having cifs_posix_open call cifs_new_fileinfo is problematic and
inconsistent with how "regular" opens work. It's also buggy as
cifs_reopen_file calls this function on a reconnect, which creates a new
struct cifsFileInfo that just gets leaked.
Push it out into the callers. This also allows us to get rid of the
"mnt" arg to cifs_posix_open.
Finally, in the event that a cifsFileInfo isn't or can't be created, we
always want to close the filehandle out on the server as the client
won't have a record of the filehandle and can't actually use it. Make
sure that CIFSSMBClose is called in those cases.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
In commit 6b82021b9e, we increase
our local alloc size and calculate how much megabytes we can
get according to group size and volume size.
But we also need to check the maximum bits a local alloc block
bitmap can have. With a bs=512, cs=32K, local volume with 160G,
it calculate 96MB while the maximum local alloc size is only
76M. So the bitmap will overflow and corrupt the system truncate
log file. See bug
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1262
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We used to let orphan scan work in the default work queue,
but there is a corner case which will make the system deadlock.
The scenario is like this:
1. set heartbeat threadshold to 200. this will allow us to have a
great chance to have a orphan scan work before our quorum decision.
2. mount node 1.
3. after 1~2 minutes, mount node 2(in order to make the bug easier
to reproduce, better add maxcpus=1 to kernel command line).
4. node 1 do orphan scan work.
5. node 2 do orphan scan work.
6. node 1 do orphan scan work. After this, node 1 hold the orphan scan
lock while node 2 know node 1 is the master.
7. ifdown eth2 in node 2(eth2 is what we do ocfs2 interconnection).
Now when node 2 begins orphan scan, the system queue is blocked.
The root cause is that both orphan scan work and quorum decision work
will use the system event work queue. orphan scan has a chance of
blocking the event work queue(in dlm_wait_for_node_death) so that there
is no chance for quorum decision work to proceed.
This patch resolve it by moving orphan scan work to ocfs2_wq.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Add a spin_unlock missing on the error path. Unlock as in the other code
that leads to the leave label.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression E1;
@@
* spin_lock(E1,...);
<+... when != E1
if (...) {
... when != E1
* return ...;
}
...+>
* spin_unlock(E1,...);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Some bogus firmwares include properties with "/" in their name. This
causes problems when creating the /proc/device-tree file system,
because the slash is taken to indicate a directory.
We don't care about those properties, and we don't want to encourage
them, so just throw them away when creating /proc/device-tree.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
We need to properly initialize skip, as not all alloc_msg op instances
set it.
Also, BUG if someone says skip but also allocates a message.
Signed-off-by: Sage Weil <sage@newdream.net>
The standard behavior for drop_inode is to delete the inode when the
last reference to it is put and the nlink count goes to 0. This helps
keep inodes that are still considered "not deleted" in cache as long as
possible even when there aren't dentries attached to them.
When server inode numbers are disabled, it's not possible for cifs_iget
to ever match an existing inode (since inode numbers are generated via
iunique). In this situation, cifs can keep a lot of inodes in cache that
will never be used again.
Implement a drop_inode routine that deletes the inode if server inode
numbers are disabled on the mount. This helps keep the cifs inode
caches down to a more manageable size when server inode numbers are
disabled.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Busy-file renames don't actually work across directories, so we need
to limit this code to renames within the same dir.
This fixes the bug detailed here:
https://bugzilla.redhat.com/show_bug.cgi?id=591938
Signed-off-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: The file argument for fsync() is never null
Btrfs: handle ERR_PTR from posix_acl_from_xattr()
Btrfs: avoid BUG when dropping root and reference in same transaction
Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
Btrfs: should add a permission check for setfacl
Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs
Btrfs: unwind after btrfs_start_transaction() errors
Btrfs: btrfs_iget() returns ERR_PTR
Btrfs: handle kzalloc() failure in open_ctree()
Btrfs: handle error returns from btrfs_lookup_dir_item()
Btrfs: Fix BUG_ON for fs converted from extN
Btrfs: Fix null dereference in relocation.c
Btrfs: fix remap_file_pages error
Btrfs: uninitialized data is check_path_shared()
Btrfs: fix fallocate regression
Btrfs: fix loop device on top of btrfs
The "file" argument for fsync is never null so we can remove this check.
What drew my attention here is that 7ea8085910: "drop unused dentry
argument to ->fsync" introduced an unconditional dereference at the
start of the function and that generated a smatch warning.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
posix_acl_from_xattr() returns both ERR_PTRs and null, but it's OK to
pass null values to set_cached_acl()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes
with end_transaction(), the cleaner kthread may come in and
drop the root in the same transaction. If that's the case, the
root's refs still == 1 in the tree when btrfs_del_root() deletes
the item, because commit_fs_roots() hasn't updated it yet (that
happens during the commit).
This wasn't a problem before only because
btrfs_ioctl_snap_destroy() would commit the transaction before dropping
the dentry reference, so the dead root wouldn't get queued up until
after the fs root item was updated in the btree.
Since it is not an error to drop the root reference and the root in the
same transaction, just drop the BUG_ON() in btrfs_del_root().
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
when used Posix File System Test Suite(pjd-fstest) to test btrfs,
some cases about setfacl failed when noacl mount option used.
I simplified used commands in pjd-fstest, and the following steps
can reproduce it.
------------------------
# cd btrfs-part/
# mkdir aaa
# setfacl -m m::rw aaa <- successed, but not expected by pjd-fstest.
------------------------
I checked ext3, a warning message occured, like as:
setfacl: aaa/: Operation not supported
Certainly, it's expected by pjd-fstest.
So, i compared acl.c of btrfs and ext3. Based on that, a patch created.
Fortunately, it works.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On btrfs, do the following
------------------
# su user1
# cd btrfs-part/
# touch aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rw-
group::rw-
other::r--
# su user2
# cd btrfs-part/
# setfacl -m u::rwx aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rwx <- successed to setfacl
group::rw-
other::r--
------------------
but we should prohibit it that user2 changing user1's acl.
In fact, on ext3 and other fs, a message occurs:
setfacl: aaa: Operation not permitted
This patch fixed it.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_lookup_dir_item() can return either ERR_PTRs or null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_read_fs_root_no_name() returns ERR_PTRs on error so I added a
check for that. It's not clear to me if it can also return NULL
pointers or not so I left the original NULL pointer check as is.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This was added by a22285a6a3: "Btrfs: Integrate metadata reservation
with start_transaction". If we goto out here then we skip all the
unwinding and there are locks still held etc.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_iget() returns an ERR_PTR() on failure and not null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Unwind and return -ENOMEM if the allocation fails here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_lookup_dir_item() fails, we should can just let the mount fail
with an error.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tree blocks can live in data block groups in FS converted from extN.
So it's easy to trigger the BUG_ON.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix a potential null dereference in relocation.c
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Acked-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: try to send partial cap release on cap message on missing inode
ceph: release cap on import if we don't have the inode
ceph: fix misleading/incorrect debug message
ceph: fix atomic64_t initialization on ia64
ceph: fix lease revocation when seq doesn't match
ceph: fix f_namelen reported by statfs
ceph: fix memory leak in statfs
ceph: fix d_subdirs ordering problem
when we use remap_file_pages() to remap a file, remap_file_pages always return
error. It is because btrfs didn't set VM_CAN_NONLINEAR for vma.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
refs can be used with uninitialized data if btrfs_lookup_extent_info()
fails on the first pass through the loop. In the original code if that
happens then check_path_shared() probably returns 1, this patch
changes it to return 1 for safety.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Seems that when btrfs_fallocate was converted to use the new ENOSPC stuff we
dropped passing the mode to the function that actually does the preallocation.
This breaks anybody who wants to use FALLOC_FL_KEEP_SIZE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We cannot use the loop device which has been connected to a file in the btrf
The reproduce steps is following:
# dd if=/dev/zero of=vdev0 bs=1M count=1024
# losetup /dev/loop0 vdev0
# mkfs.btrfs /dev/loop0
...
failed to zero device start -5
The reason is that the btrfs don't implement either ->write_begin or ->write
the VFS API, so we fix it by setting ->write to do_sync_write().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We need to check for s_instances to make sure we don't bother working
against a filesystem that is beeing unmounted, and we need to call
put_super to make sure a superblock is freed when we race against
umount. Also no need to keep sb_lock after we got a reference on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
In "writeback: fix writeback_inodes_wb from writeback_inodes_sb" I
accidentally removed the requeue_io if we need to skip a superblock
because we can't pin it. Add it back, otherwise we're getting spurious
lockups after multiple xfstests runs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
bdi_start_writeback now never gets a superblock passed, so we can just remove
that case. And to further untangle the code and flatten the call stack
split it into two trivial helpers for it's two callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
bdi_writeback_all only has one caller, so fold it to simplify the code and
flatten the call stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When we call writeback_inodes_wb from writeback_inodes_sb we always have
s_umount held, which currently makes the whole operation a no-op.
But if we are called to write out inodes for a specific superblock we always
have s_umount held, so replace the incorrect logic checking for WB_SYNC_ALL
which only worked by coincidence with the proper check for an explicit
superblock argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Make sure that not only sync_filesystem but all callers of writeback_inodes_sb
have the superblock protected against remount. As-is this disables all
functionality for these callers, but the next patch relies on this locking to
fix writeback_inodes_sb for sync_filesystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If we want to rely on s_umount in the caller we need to wait for completion
of the I/O submission before returning to the caller. Refactor
bdi_sync_writeback into a bdi_queue_work_onstack helper and use it for this
case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The code dealing with bdi_work->state and completion of a bdi_work is a
major mess currently. This patch makes sure we directly use one set of
flags to deal with it, and use it consistently, which means:
- always notify about completion from the rcu callback. We only ever
wait for it from on-stack callers, so this simplification does not
even cause a theoretical slowdown currently. It also makes sure we
don't miss out on the notification if we ever add other callers to
wait for it.
- make earlier completion notification depending on the on-stack
allocation, not the sync mode. If we introduce new callers that
want to do WB_SYNC_NONE writeback from on-stack callers this will
be nessecary.
Also rename bdi_wait_on_work_clear to bdi_wait_on_work_done and inline
a few small functions into their only caller to make the code
understandable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If we have enough memory to allocate a new cap release message, do so, so
that we can send a partial release message immediately. This keeps us from
making the MDS wait when the cap release it needs is in a partially full
release message.
If we fail because of ENOMEM, oh well, they'll just have to wait a bit
longer.
Signed-off-by: Sage Weil <sage@newdream.net>
If we get an IMPORT that give us a cap, but we don't have the inode, queue
a release (and try to send it immediately) so that the MDS doesn't get
stuck waiting for us.
Signed-off-by: Sage Weil <sage@newdream.net>
bdi_seq is an atomic_long_t but we're using ATOMIC_INIT, which causes
build failures on ia64. This patch fixes it to use ATOMIC_LONG_INIT.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Sage Weil <sage@newdream.net>
As it stands this check compares the number of pages to the page size.
This makes no sense and makes the fcntl fail in almost any sane case.
Fix it by checking if nr_pages is not zero (it can become zero only if
arg is too big and round_pipe_size() overflows).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
pipe_set_size() needs to copy pipe bufs from the old circular buffer
to the new.
The current code gets this wrong in multiple ways, resulting in oops.
Test program is available here:
http://www.kernel.org/pub/linux/kernel/people/mszeredi/piperesize/
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We do the same BUG_ON() just a line later when calling into
__bd_abort_claiming().
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
I don't like the subtle multi-context code in bd_claim (ie. detects where it
has been called based on bd_claiming). It seems clearer to just require a new
function to finish a 2-part claim.
Also improve commentary in bd_start_claiming as to how it should
be used.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
bd_start_claiming has an unbalanced module_put introduced in 6b4517a79.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.35' of git://linux-nfs.org/~bfields/linux:
nfsd4: shut down callback queue outside state lock
nfsd: nfsd_setattr needs to call commit_metadata
Now that the background flush code has been fixed, we shouldn't need to
silently multiply the wbc->nr_to_write to get good writeback. Remove
that code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reportedly causes a lockdep warning on nfsd shutdown. That looks
like a false positive to me, but there's no reason why this needs the
state lock anyway.
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* git://git.infradead.org/~dwmw2/mtd-2.6.35:
jffs2: update ctime when changing the file's permission by setfacl
jffs2: Fix NFS race by using insert_inode_locked()
jffs2: Fix in-core inode leaks on error paths
mtd: Fix NAND submenu
mtd/r852: update card detect early.
mtd/r852: Fixes in case of DMA timeout
mtd/r852: register IRQ as last step
drivers/mtd: Use memdup_user
docbook: make mtd nand module init static
jffs2 didn't update the ctime of the file when its permission was changed.
Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1275289822
# setfacl -m 'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1275289822 <- unchanged
But, according to the spec of the ctime, jffs2 must update it.
Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
A few functions were still modifying i_flags in a racy manner.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: improve xfs_isilocked
xfs: skip writeback from reclaim context
xfs: remove done roadmap item from xfs-delayed-logging-design.txt
xfs: fix race in inode cluster freeing failing to stale inodes
xfs: fix access to upper inodes without inode64
xfs: fix might_sleep() warning when initialising per-ag tree
fs/xfs/quota: Add missing mutex_unlock
xfs: remove duplicated #include
xfs: convert more trace events to DEFINE_EVENT
xfs: xfs_trace.c: remove duplicated #include
xfs: Check new inode size is OK before preallocating
xfs: clean up xlog_align
xfs: cleanup log reservation calculactions
xfs: be more explicit if RT mount fails due to config
xfs: replace E2BIG with EFBIG where appropriate
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: remove obsolete declarations of cache constructor and destructor
nilfs2: fix style issue in nilfs_destroy_cachep
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Minix: Clean up left over label
fix truncate inode time modification breakage
fix setattr error handling in sysfs, configfs
fcntl: return -EFAULT if copy_to_user fails
wrong type for 'magic' argument in simple_fill_super()
fix the deadlock in qib_fs
mqueue doesn't need make_bad_inode()
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
block: make blk_init_free_list and elevator_init idempotent
block: avoid unconditionally freeing previously allocated request_queue
pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface
pipe: change the privilege required for growing a pipe beyond system max
pipe: adjust minimum pipe size to 1 page
block: disable preemption before using sched_clock()
cciss: call BUG() earlier
Preparing 8.3.8rc2
drbd: Reduce verbosity
drbd: use drbd specific ratelimit instead of global printk_ratelimit
drbd: fix hang on local read errors while disconnected
drbd: Removed the now empty w_io_error() function
drbd: removed duplicated #includes
drbd: improve usage of MSG_MORE
drbd: need to set socket bufsize early to take effect
drbd: improve network latency, TCP_QUICKACK
drbd: Revert "drbd: Create new current UUID as late as possible"
brd: support discard
Revert "writeback: fix WB_SYNC_NONE writeback from umount"
Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync"
...
The data chunk is mmaped with 'len' which remains unchanged, so use that
when unmapping in the error path rather than trying to recalculate (and
incorrectly so) the value used originally.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David McCullough <davidm@snapgear.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The stack and data have different alignment requirements, so don't force
them to wear the same shoe. Increase the data alignment to match that
which the elf2flt linker script has always been using: 0x20 bytes. Not
only does this bring the kernel loader in line with the toolchain, but it
also fixes a swath of gcc tests which try to force larger alignment values
but randomly fail when the FLAT loader fails to deliver.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: David McCullough <davidm@snapgear.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Tested-by: Michal Simek <monstr@monstr.eu>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jie Zhang <jie@codesourcery.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A call to access_ok is missing a compat_ptr conversion. Introduced with
b83733639a "compat: factor out
compat_rw_copy_check_uvector from compat_do_readv_writev"
fs/compat.c: In function 'compat_rw_copy_check_uvector':
fs/compat.c:629: warning: passing argument 1 of '__access_ok' makes pointer from integer without a cast
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mtime and ctime should be changed only if the file size has actually
changed. Patches changing ext2 and tmpfs from vmtruncate to new truncate
sequence has caused regressions where they always update timestamps.
There is some strange cases in POSIX where truncate(2) must not update
times unless the size has acutally changed, see 6e656be89.
This area is all still rather buggy in different ways in a lot of
filesystems and needs a cleanup and audit (ideally the vfs will provide
a simple attribute or call to direct all filesystems exactly which
attributes to change). But coming up with the best solution will take a
while and is not appropriate for rc anyway.
So fix recent regression for now.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sysfs and configfs setattr functions have error cases after the generic inode's
attributes have been changed. Fix consistency by changing the generic inode
attributes only when it is guaranteed to succeed.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
copy_to_user() returns the number of bytes remaining, but we want to
return -EFAULT.
ret = fcntl(fd, F_SETOWN_EX, NULL);
With the original code ret would be 8 here.
V2: Takuya Yoshikawa pointed out a similar issue in f_getown_ex()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's used to superblock ->s_magic, which is unsigned long.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Reviewed-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
CC: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sysfs and configfs setattr functions have error cases after the generic inode's
attributes have been changed. Fix consistency by changing the generic inode
attributes only when it is guaranteed to succeed.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If the client revokes a lease with a higher seq than what we have, keep
the mds's seq, so that it honors our release. Otherwise, we can hang
indefinitely.
Signed-off-by: Sage Weil <sage@newdream.net>
This changes the interface to be based on bytes instead. The API
matches that of F_SETPIPE_SZ in that it rounds up the passed in
size so that the resulting page array is a power-of-2 in size.
The proc file is renamed to /proc/sys/fs/pipe-max-size to
reflect this change.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Change it to CAP_SYS_RESOURCE, as that more accurately models what
we want to control.
Suggested-by: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We don't need to pages to guarantee the POSIX requirement
that upto a page size write must be atomic to an empty
pipe.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
New inodes need to be locked as we're creating them, so they don't get used
by other things (like NFSd) before they're ready.
Pointed out by Al Viro.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Use rwsem_is_locked to make the assertations for shared locks work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Allowing writeback from reclaim context causes massive problems with stack
overflows as we can call into the writeback code which tends to be a heavy
stack user both in the generic code and XFS from random contexts that
perform memory allocations.
Follow the example of btrfs (and in slightly different form ext4) and refuse
to write out data from reclaim context. This issue should really be handled
by the VM so that we can tune better for this case, but until we get it
sorted out there we have to hack around this in each filesystem with a
complex writeback path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When an inode cluster is freed, it needs to mark all inodes in memory as
XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes
have a different life cycle to the buffer, and once the buffer is torn down
during transaction completion, we must ensure none of the inodes get written
back (which is what XFS_ISTALE does).
Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being
marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these
inodes either during inode reclaim or tail pushing on the AIL. The buffer is
read back, but no longer contains inodes and so triggers assert failures and
shutdowns. This was reproducable with at run.dbench10 invocation from xfstests.
There are two main causes of xfs_ifree_cluster() failing. The first is simple -
it checks in-memory inodes it finds in the per-ag icache to see if they are
clean without holding the flush lock. if they are clean it skips them
completely. However, If an inode is flushed delwri, it will
appear clean, but is not guaranteed to be written back until the flush lock has
been dropped. Hence we may have raced on the clean check and the inode may
actually be dirty. Hence always mark inodes found in memory stale before we
check properly if they are clean.
The second is more complex, and makes the first problem easier to hit.
Basically the in-memory inode scan is done with full knowledge it can be racing
with inode flushing and AIl tail pushing, which means that inodes that it can't
get the flush lock on might not be attached to the buffer after then in-memory
inode scan due to IO completion occurring. This is actually documented in the
code as "needs better interlocking". i.e. this is a zero-day bug.
Effectively, the in-memory scan must be done while the inode buffer is locked
and Io cannot be issued on it while we do the in-memory inode scan. This
ensures that inodes we couldn't get the flush lock on are guaranteed to be
attached to the cluster buffer, so we can then catch all in-memory inodes and
mark them stale.
Now that the inode cluster buffer is locked before the in-memory scan is done,
there is no need for the two-phase update of the in-memory inodes, so simplify
the code into two loops and remove the allocation of the temporary buffer used
to hold locked inodes across the phases.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dan Roseberg has reported a problem with the MOVE_EXT ioctl. If the
donor file is an append-only file, we should not allow the operation
to proceed, lest we end up overwriting the contents of an append-only
file.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dan Rosenberg <dan.j.rosenberg@gmail.com>
We were setting f_namelen in kstatfs to PATH_MAX instead of NAME_MAX.
That disagrees with ceph_lookup behavior (which checks against NAME_MAX),
and also makes the pjd posix test suite spit out ugly errors because with
can't clean up its temporary files.
Signed-off-by: Sage Weil <sage@newdream.net>
We misused list_move_tail() to order the dentry in d_subdirs.
This will screw up the d_subdirs order.
This bug can be reliably reproduced by:
1. mount ceph fs.
2. on ceph fs, git clone git://ceph.newdream.net/git/ceph.git
3. Run autogen.sh in ceph directory.
(Note: Errors only occur at the first time you run autogen.sh.)
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The conversion of write_inode_now calls to commit_metadata in commit
f501912a35 missed out the call in nfsd_setattr.
But without this conversion we can't guarantee that a SETATTR request
has actually been commited to disk with XFS, which causes a regression
from 2.6.32 (only for NFSv2, but anyway).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
fscache_write_op() makes unnecessary checks of the page variable to see if it
is NULL. It can't be NULL at those points as the kernel would already have
crashed a little higher up where we examined page->index.
Furthermore, unless radix_tree_gang_lookup_tag() can return 1 but no page, a
NULL pointer crash should not be encountered there as we can only get there if
r_t_g_l_t() returned 1.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 315e995c63 is causing OOM kills
when stress-testing a CIFS filesystem. The VFS readpages operation takes
a page reference. The older code just handed this reference off to the
page cache, but the new code takes an extra one. The simplest fix is to
put the new reference after add_to_page_cache_lru.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fix a possible null pointer dereference in afs_alloc_server(): the server
pointer is NULL if there was an allocation failure, and under such a
condition, we can't dereference it in the _leave() statement.
Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clear_user() returns the number of bytes that could not be copied rather than
an error code. So we should return -EFAULT rather than directly returning the
results.
Without this patch, positive values may be returned to elf_fdpic_map_file()
and the following error handlings do not function as expected.
1.
ret = elf_fdpic_map_file_constdisp_on_uclinux(params, file, mm);
if (ret < 0)
return ret;
2.
ret = elf_fdpic_map_file_by_direct_mmap(params, file, mm);
if (ret < 0)
return ret;
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
CC: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit e913fc825d.
We are investigating a hang associated with the WB_SYNC_NONE changes,
so revert them for now.
Conflicts:
fs/fs-writeback.c
mm/page-writeback.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This reverts commit 7c8a3554c6.
We are investigating a hang associated with the WB_SYNC_NONE changes,
so revert them for now.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The commit 41c88bd7 ("nilfs2: cleanup multi
kmem_cache_{create,destroy} code") consolidated slab constructors and
destructors used in nilfs, but it left some declarations in header
files.
This gets rid of the obsolete declarations.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This gets rid of unwanted space chars in front of conditional
sentences of nilfs_destroy_cachep().
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
mm: export generic_pipe_buf_*() to modules
fuse: support splice() reading from fuse device
fuse: allow splice to move pages
mm: export remove_from_page_cache() to modules
mm: export lru_cache_add_*() to modules
fuse: support splice() writing to fuse device
fuse: get page reference for readpages
fuse: use get_user_pages_fast()
fuse: remove unneeded variable
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
quota: Convert quota statistics to generic percpu_counter
ext3 uses rb_node = NULL; to zero rb_root.
quota: Fixup dquot_transfer
reiserfs: Fix resuming of quotas on remount read-write
pohmelfs: Remove dead quota code
ufs: Remove dead quota code
udf: Remove dead quota code
quota: rename default quotactl methods to dquot_
quota: explicitly set ->dq_op and ->s_qcop
quota: drop remount argument to ->quota_on and ->quota_off
quota: move unmount handling into the filesystem
quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
quota: move remount handling into the filesystem
ocfs2: Fix use after free on remount read-only
Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: clean up on forwarded aborted mds request
ceph: fix leak of osd authorizer
ceph: close out mds, osd connections before stopping auth
ceph: make lease code DN specific
fs/ceph: Use ERR_CAST
ceph: renew auth tickets before they expire
ceph: do not resend mon requests on auth ticket renewal
ceph: removed duplicated #includes
ceph: avoid possible null dereference
ceph: make mds requests killable, not interruptible
sched: add wait_for_completion_killable_timeout
If an mds request is aborted (timeout, SIGKILL), it is left registered to
keep our state in sync with the mds. If we get a forward notification,
though, we know the request didn't succeed and we can unregister it
safely. We were trying to resend it, but then bailing out (and not
unregistering) in __do_request.
Signed-off-by: Sage Weil <sage@newdream.net>
The auth module (part of the mon_client) is needed to free any
ceph_authorizer(s) used by the mds and osd connections. Flush the msgr
workqueue before stopping monc to ensure that the destroy_authorizer
auth op is available when those connections are closed out.
Signed-off-by: Sage Weil <sage@newdream.net>
The lease code includes a mask in the CEPH_LOCK_* namespace, but that
namespace is changing, and only one mask (formerly _DN == 1) is used, so
hard code for that value for now.
If we ever extend this code to handle leases over different data types we
can extend it accordingly.
Signed-off-by: Sage Weil <sage@newdream.net>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of
the returned value is the same as the type of the enclosing function.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
We were only requesting renewal after our tickets expire; do so before
that. Most of the low-level logic for this was already there; just use
it.
Signed-off-by: Sage Weil <sage@newdream.net>
We only want to send pending mon requests when we successfully
authenticate. If we are already authenticated, like when we renew our
ticket, there is no need to resend pending requests.
Signed-off-by: Sage Weil <sage@newdream.net>
fs/ceph/auth.c: linux/slab.h is included more than once.
fs/ceph/super.h: linux/slab.h is included more than once.
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Sage Weil <sage@newdream.net>
ac->ops may be null; use protocol id in error message instead.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The underlying problem is that many mds requests can't be restarted. For
example, a restarted create() would return -EEXIST if the original request
succeeds. However, we do not want a hung MDS to hang the client too. So,
use the _killable wait_for_completion variants to abort on SIGKILL but
nothing else.
Signed-off-by: Sage Weil <sage@newdream.net>
If a filesystem is mounted without the inode64 mount option we
should still be able to access inodes not fitting into 32 bits, just
not created new ones. For this to work we need to make sure the
inode cache radix tree is initialized for all allocation groups, not
just those we plan to allocate inodes from. This patch makes sure
we initialize the inode cache radix tree for all allocation groups,
and also cleans xfs_initialize_perag up a bit to separate the
inode32 logical from the general perag structure setup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The use of radix_tree_preload() only works if the radix tree was
initialised without the __GFP_WAIT flag. The per-ag tree uses
GFP_NOFS, so does not trigger allocation of new tree nodes from the
preloaded array. Hence it enters the allocator with a spinlock held
and triggers the might_sleep() warnings.
Reported-by; Chris Mason <chris.mason@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Add a mutex_unlock missing on the error path. The use of this lock
is balanced elsewhere in the file.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression E1;
@@
* mutex_lock(E1,...);
<+... when != E1
if (...) {
... when != E1
* return ...;
}
...+>
* mutex_unlock(E1,...);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Alex Elder <aelder@sgi.com>
Use DECLARE_EVENT_CLASS, and save ~15K:
text data bss dec hex filename
171949 43028 48 215025 347f1 fs/xfs/linux-2.6/xfs_trace.o.orig
156521 43028 36 199585 30ba1 fs/xfs/linux-2.6/xfs_trace.o
No change in functionality.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alex Elder <aelder@sgi.com>
The new xfsqa test 228 tries to preallocate more space than the
filesystem contains. it should fail, but instead triggers an assert
about lock flags. The failure is due to the size extension failing
in vmtruncate() due to rlimit being set. Check this before we start
the preallocation to avoid allocating space that will never be used.
Also the path through xfs_vn_allocate already holds the IO lock, so
it should not be present in the lock flags when the setattr fails.
Hence the assert needs to take this into account. This will prevent
other such callers from hitting this incorrect ASSERT.
(Fixed a reference to "newsize" to read "new_size". -Alex)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>