linux-next

mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-26 06:04:14 +08:00

Author	SHA1	Message	Date
Dan Carpenter	cf1e99a4e0	Btrfs: btrfs_lookup_dir_item() can return ERR_PTR btrfs_lookup_dir_item() can return either ERR_PTRs or null. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 15:57:37 -04:00
Dan Carpenter	3140c9a34b	Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs btrfs_read_fs_root_no_name() returns ERR_PTRs on error so I added a check for that. It's not clear to me if it can also return NULL pointers or not so I left the original NULL pointer check as is. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 15:57:36 -04:00
Dan Carpenter	d327099a23	Btrfs: unwind after btrfs_start_transaction() errors This was added by `a22285a6a3`: "Btrfs: Integrate metadata reservation with start_transaction". If we goto out here then we skip all the unwinding and there are locks still held etc. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 15:57:35 -04:00
Dan Carpenter	4cbd1149fb	Btrfs: btrfs_iget() returns ERR_PTR btrfs_iget() returns an ERR_PTR() on failure and not null. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 15:57:35 -04:00
Dan Carpenter	676e4c8639	Btrfs: handle kzalloc() failure in open_ctree() Unwind and return -ENOMEM if the allocation fails here. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 15:57:34 -04:00
Dan Carpenter	fb4f6f910c	Btrfs: handle error returns from btrfs_lookup_dir_item() If btrfs_lookup_dir_item() fails, we should can just let the mount fail with an error. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 15:57:33 -04:00
Yan, Zheng	3bf84a5a83	Btrfs: Fix BUG_ON for fs converted from extN Tree blocks can live in data block groups in FS converted from extN. So it's easy to trigger the BUG_ON. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 15:48:35 -04:00
Yan, Zheng	046f264f6b	Btrfs: Fix null dereference in relocation.c Fix a potential null dereference in relocation.c Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Acked-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 15:48:34 -04:00
Linus Torvalds	63c70a0d7b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: try to send partial cap release on cap message on missing inode ceph: release cap on import if we don't have the inode ceph: fix misleading/incorrect debug message ceph: fix atomic64_t initialization on ia64 ceph: fix lease revocation when seq doesn't match ceph: fix f_namelen reported by statfs ceph: fix memory leak in statfs ceph: fix d_subdirs ordering problem	2010-06-11 09:52:23 -07:00
Miao Xie	058a457ef0	Btrfs: fix remap_file_pages error when we use remap_file_pages() to remap a file, remap_file_pages always return error. It is because btrfs didn't set VM_CAN_NONLINEAR for vma. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 11:46:12 -04:00
Dan Carpenter	0e4dcbef1c	Btrfs: uninitialized data is check_path_shared() refs can be used with uninitialized data if btrfs_lookup_extent_info() fails on the first pass through the loop. In the original code if that happens then check_path_shared() probably returns 1, this patch changes it to return 1 for safety. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 11:46:12 -04:00
Josef Bacik	8360977972	Btrfs: fix fallocate regression Seems that when btrfs_fallocate was converted to use the new ENOSPC stuff we dropped passing the mode to the function that actually does the preallocation. This breaks anybody who wants to use FALLOC_FL_KEEP_SIZE. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 11:46:12 -04:00
Miao Xie	4a001071d3	Btrfs: fix loop device on top of btrfs We cannot use the loop device which has been connected to a file in the btrf The reproduce steps is following: # dd if=/dev/zero of=vdev0 bs=1M count=1024 # losetup /dev/loop0 vdev0 # mkfs.btrfs /dev/loop0 ... failed to zero device start -5 The reason is that the btrfs don't implement either ->write_begin or ->write the VFS API, so we fix it by setting ->write to do_sync_write(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-06-11 11:46:11 -04:00
Christoph Hellwig	29cb48594b	writeback: fix pin_sb_for_writeback We need to check for s_instances to make sure we don't bother working against a filesystem that is beeing unmounted, and we need to call put_super to make sure a superblock is freed when we race against umount. Also no need to keep sb_lock after we got a reference on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-11 12:58:08 +02:00
Christoph Hellwig	334132ae92	writeback: add missing requeue_io in writeback_inodes_wb In "writeback: fix writeback_inodes_wb from writeback_inodes_sb" I accidentally removed the requeue_io if we need to skip a superblock because we can't pin it. Add it back, otherwise we're getting spurious lockups after multiple xfstests runs. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-11 12:58:08 +02:00
Christoph Hellwig	c5444198ca	writeback: simplify and split bdi_start_writeback bdi_start_writeback now never gets a superblock passed, so we can just remove that case. And to further untangle the code and flatten the call stack split it into two trivial helpers for it's two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-11 12:58:08 +02:00
Christoph Hellwig	b8c2f3474f	writeback: simplify wakeup_flusher_threads bdi_writeback_all only has one caller, so fold it to simplify the code and flatten the call stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-11 12:58:07 +02:00
Christoph Hellwig	d19de7edf5	writeback: fix writeback_inodes_wb from writeback_inodes_sb When we call writeback_inodes_wb from writeback_inodes_sb we always have s_umount held, which currently makes the whole operation a no-op. But if we are called to write out inodes for a specific superblock we always have s_umount held, so replace the incorrect logic checking for WB_SYNC_ALL which only worked by coincidence with the proper check for an explicit superblock argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-11 12:58:07 +02:00
Christoph Hellwig	cf37e97247	writeback: enforce s_umount locking in writeback_inodes_sb Make sure that not only sync_filesystem but all callers of writeback_inodes_sb have the superblock protected against remount. As-is this disables all functionality for these callers, but the next patch relies on this locking to fix writeback_inodes_sb for sync_filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-11 12:58:07 +02:00
Christoph Hellwig	3c4d716538	writeback: queue work on stack in writeback_inodes_sb If we want to rely on s_umount in the caller we need to wait for completion of the I/O submission before returning to the caller. Refactor bdi_sync_writeback into a bdi_queue_work_onstack helper and use it for this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-11 12:58:07 +02:00
Christoph Hellwig	7f0e7bed93	writeback: fix writeback completion notifications The code dealing with bdi_work->state and completion of a bdi_work is a major mess currently. This patch makes sure we directly use one set of flags to deal with it, and use it consistently, which means: - always notify about completion from the rcu callback. We only ever wait for it from on-stack callers, so this simplification does not even cause a theoretical slowdown currently. It also makes sure we don't miss out on the notification if we ever add other callers to wait for it. - make earlier completion notification depending on the on-stack allocation, not the sync mode. If we introduce new callers that want to do WB_SYNC_NONE writeback from on-stack callers this will be nessecary. Also rename bdi_wait_on_work_clear to bdi_wait_on_work_done and inline a few small functions into their only caller to make the code understandable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-11 12:58:07 +02:00
Sage Weil	2b2300d62e	ceph: try to send partial cap release on cap message on missing inode If we have enough memory to allocate a new cap release message, do so, so that we can send a partial release message immediately. This keeps us from making the MDS wait when the cap release it needs is in a partially full release message. If we fail because of ENOMEM, oh well, they'll just have to wait a bit longer. Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-10 13:30:25 -07:00
Sage Weil	3d7ded4d81	ceph: release cap on import if we don't have the inode If we get an IMPORT that give us a cap, but we don't have the inode, queue a release (and try to send it immediately) so that the MDS doesn't get stuck waiting for us. Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-10 13:30:07 -07:00
Sage Weil	9dbd412f56	ceph: fix misleading/incorrect debug message Nothing is released here: the caps message is simply ignored in this case. Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-10 13:29:59 -07:00
Jeff Mahoney	00d5643e7c	ceph: fix atomic64_t initialization on ia64 bdi_seq is an atomic_long_t but we're using ATOMIC_INIT, which causes build failures on ia64. This patch fixes it to use ATOMIC_LONG_INIT. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-10 13:29:50 -07:00
Miklos Szeredi	6db40cf047	pipe: fix check in "set size" fcntl As it stands this check compares the number of pages to the page size. This makes no sense and makes the fcntl fail in almost any sane case. Fix it by checking if nr_pages is not zero (it can become zero only if arg is too big and round_pipe_size() overflows). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-10 19:08:34 +02:00
Miklos Szeredi	1d862f4122	pipe: fix pipe buffer resizing pipe_set_size() needs to copy pipe bufs from the old circular buffer to the new. The current code gets this wrong in multiple ways, resulting in oops. Test program is available here: http://www.kernel.org/pub/linux/kernel/people/mszeredi/piperesize/ Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-10 19:08:34 +02:00
Jens Axboe	3e6c05052c	block: remove duplicate BUG_ON() in bd_finish_claiming() We do the same BUG_ON() just a line later when calling into __bd_abort_claiming(). Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-10 19:08:34 +02:00
Nick Piggin	b0018361c3	block: bd_start_claiming cleanup I don't like the subtle multi-context code in bd_claim (ie. detects where it has been called based on bd_claiming). It seems clearer to just require a new function to finish a 2-part claim. Also improve commentary in bd_start_claiming as to how it should be used. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-10 19:08:34 +02:00
Nick Piggin	cf3425707e	block: bd_start_claiming fix module refcount bd_start_claiming has an unbalanced module_put introduced in `6b4517a79`. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-10 19:08:34 +02:00
Linus Torvalds	b95a568093	Merge branch 'for-2.6.35' of git://linux-nfs.org/~bfields/linux * 'for-2.6.35' of git://linux-nfs.org/~bfields/linux: nfsd4: shut down callback queue outside state lock nfsd: nfsd_setattr needs to call commit_metadata	2010-06-09 12:43:04 -07:00
Dave Chinner	254c8c2dbf	xfs: remove nr_to_write writeback windup. Now that the background flush code has been fixed, we shouldn't need to silently multiply the wbc->nr_to_write to get good writeback. Remove that code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-08 18:12:44 -07:00
J. Bruce Fields	44b56603c4	Merge branch 'for-2.6.34-incoming' into for-2.6.35-incoming	2010-06-08 20:05:18 -04:00
J. Bruce Fields	c3935e3049	nfsd4: shut down callback queue outside state lock This reportedly causes a lockdep warning on nfsd shutdown. That looks like a false positive to me, but there's no reason why this needs the state lock anyway. Reported-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-06-08 19:33:52 -04:00
Linus Torvalds	3975d16760	Merge git://git.infradead.org/~dwmw2/mtd-2.6.35 * git://git.infradead.org/~dwmw2/mtd-2.6.35: jffs2: update ctime when changing the file's permission by setfacl jffs2: Fix NFS race by using insert_inode_locked() jffs2: Fix in-core inode leaks on error paths mtd: Fix NAND submenu mtd/r852: update card detect early. mtd/r852: Fixes in case of DMA timeout mtd/r852: register IRQ as last step drivers/mtd: Use memdup_user docbook: make mtd nand module init static	2010-06-07 17:10:06 -07:00
Jan Kara	1c24d06f8e	jffs2: update ctime when changing the file's permission by setfacl jffs2 didn't update the ctime of the file when its permission was changed. Steps to reproduce: # touch aaa # stat -c %Z aaa 1275289822 # setfacl -m 'u::x,g::x,o::x' aaa # stat -c %Z aaa 1275289822 <- unchanged But, according to the spec of the ctime, jffs2 must update it. Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2010-06-06 11:27:10 +01:00
Linus Torvalds	78b36558b7	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags ext4: Make sure the MOVE_EXT ioctl can't overwrite append-only files	2010-06-05 10:07:25 -07:00
Dmitry Monakhov	84a8dce271	ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags A few functions were still modifying i_flags in a racy manner. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2010-06-05 11:51:27 -04:00
Linus Torvalds	6c5de280b6	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: improve xfs_isilocked xfs: skip writeback from reclaim context xfs: remove done roadmap item from xfs-delayed-logging-design.txt xfs: fix race in inode cluster freeing failing to stale inodes xfs: fix access to upper inodes without inode64 xfs: fix might_sleep() warning when initialising per-ag tree fs/xfs/quota: Add missing mutex_unlock xfs: remove duplicated #include xfs: convert more trace events to DEFINE_EVENT xfs: xfs_trace.c: remove duplicated #include xfs: Check new inode size is OK before preallocating xfs: clean up xlog_align xfs: cleanup log reservation calculactions xfs: be more explicit if RT mount fails due to config xfs: replace E2BIG with EFBIG where appropriate	2010-06-05 07:33:05 -07:00
Linus Torvalds	7926e0bfbb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: remove obsolete declarations of cache constructor and destructor nilfs2: fix style issue in nilfs_destroy_cachep	2010-06-05 07:31:13 -07:00
Linus Torvalds	7f0d384caf	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: Minix: Clean up left over label fix truncate inode time modification breakage fix setattr error handling in sysfs, configfs fcntl: return -EFAULT if copy_to_user fails wrong type for 'magic' argument in simple_fill_super() fix the deadlock in qib_fs mqueue doesn't need make_bad_inode()	2010-06-04 21:12:39 -07:00
Linus Torvalds	d2dd328b7f	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits) block: make blk_init_free_list and elevator_init idempotent block: avoid unconditionally freeing previously allocated request_queue pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface pipe: change the privilege required for growing a pipe beyond system max pipe: adjust minimum pipe size to 1 page block: disable preemption before using sched_clock() cciss: call BUG() earlier Preparing 8.3.8rc2 drbd: Reduce verbosity drbd: use drbd specific ratelimit instead of global printk_ratelimit drbd: fix hang on local read errors while disconnected drbd: Removed the now empty w_io_error() function drbd: removed duplicated #includes drbd: improve usage of MSG_MORE drbd: need to set socket bufsize early to take effect drbd: improve network latency, TCP_QUICKACK drbd: Revert "drbd: Create new current UUID as late as possible" brd: support discard Revert "writeback: fix WB_SYNC_NONE writeback from umount" Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync" ...	2010-06-04 15:37:44 -07:00
Linus Torvalds	f9196e7c03	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: fix setattr error handling in sysfs, configfs kobject: free memory if netlink_kernel_create() fails lib/kobject_uevent.c: fix CONIG_NET=n warning	2010-06-04 15:27:27 -07:00
Mike Frysinger	1da083c9b2	flat: fix unmap len in load error path The data chunk is mmaped with 'len' which remains unchanged, so use that when unmapping in the error path rather than trying to recalculate (and incorrectly so) the value used originally. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: David McCullough <davidm@snapgear.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-04 15:21:45 -07:00
Mike Frysinger	2e94de8acb	fs/binfmt_flat.c: split the stack & data alignments The stack and data have different alignment requirements, so don't force them to wear the same shoe. Increase the data alignment to match that which the elf2flt linker script has always been using: 0x20 bytes. Not only does this bring the kernel loader in line with the toolchain, but it also fixes a swath of gcc tests which try to force larger alignment values but randomly fail when the FLAT loader fails to deliver. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: David McCullough <davidm@snapgear.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: Paul Mundt <lethal@linux-sh.org> Tested-by: Michal Simek <monstr@monstr.eu> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Jie Zhang <jie@codesourcery.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-04 15:21:44 -07:00
Heiko Carstens	7cbe17701a	fs/compat_rw_copy_check_uvector: add missing compat_ptr call A call to access_ok is missing a compat_ptr conversion. Introduced with `b83733639a` "compat: factor out compat_rw_copy_check_uvector from compat_do_readv_writev" fs/compat.c: In function 'compat_rw_copy_check_uvector': fs/compat.c:629: warning: passing argument 1 of '__access_ok' makes pointer from integer without a cast Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-04 15:21:44 -07:00
Andrew Hendry	01afaf6198	Minix: Clean up left over label Remove a left over fail label. Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-06-04 17:16:30 -04:00
Nick Piggin	af5a30d8cf	fix truncate inode time modification breakage mtime and ctime should be changed only if the file size has actually changed. Patches changing ext2 and tmpfs from vmtruncate to new truncate sequence has caused regressions where they always update timestamps. There is some strange cases in POSIX where truncate(2) must not update times unless the size has acutally changed, see `6e656be89`. This area is all still rather buggy in different ways in a lot of filesystems and needs a cleanup and audit (ideally the vfs will provide a simple attribute or call to direct all filesystems exactly which attributes to change). But coming up with the best solution will take a while and is not appropriate for rc anyway. So fix recent regression for now. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-06-04 17:16:30 -04:00
Nick Piggin	8718d36cf9	fix setattr error handling in sysfs, configfs sysfs and configfs setattr functions have error cases after the generic inode's attributes have been changed. Fix consistency by changing the generic inode attributes only when it is guaranteed to succeed. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-06-04 17:16:29 -04:00
Dan Carpenter	5b54470dad	fcntl: return -EFAULT if copy_to_user fails copy_to_user() returns the number of bytes remaining, but we want to return -EFAULT. ret = fcntl(fd, F_SETOWN_EX, NULL); With the original code ret would be 8 here. V2: Takuya Yoshikawa pointed out a similar issue in f_getown_ex() Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-06-04 17:16:28 -04:00
Roberto Sassu	7d683a0999	wrong type for 'magic' argument in simple_fill_super() It's used to superblock ->s_magic, which is unsigned long. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Reviewed-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: Eric Paris <eparis@redhat.com> CC: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-06-04 17:16:28 -04:00
Nick Piggin	75de46b98d	fix setattr error handling in sysfs, configfs sysfs and configfs setattr functions have error cases after the generic inode's attributes have been changed. Fix consistency by changing the generic inode attributes only when it is guaranteed to succeed. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-06-04 13:27:53 -07:00
Alex Elder	1bf7dbfde8	Merge branch 'master' into for-linus	2010-06-04 13:22:30 -05:00
Sage Weil	1e5ea23df1	ceph: fix lease revocation when seq doesn't match If the client revokes a lease with a higher seq than what we have, keep the mds's seq, so that it honors our release. Otherwise, we can hang indefinitely. Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-04 10:05:40 -07:00
Linus Torvalds	03cd373981	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: fix page refcount leak	2010-06-03 07:20:28 -07:00
Jens Axboe	ff9da691c0	pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface This changes the interface to be based on bytes instead. The API matches that of F_SETPIPE_SZ in that it rounds up the passed in size so that the resulting page array is a power-of-2 in size. The proc file is renamed to /proc/sys/fs/pipe-max-size to reflect this change. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-03 14:54:39 +02:00
Jens Axboe	419f8367ea	pipe: change the privilege required for growing a pipe beyond system max Change it to CAP_SYS_RESOURCE, as that more accurately models what we want to control. Suggested-by: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-03 12:45:28 +02:00
Jens Axboe	6a6ca57de9	pipe: adjust minimum pipe size to 1 page We don't need to pages to guarantee the POSIX requirement that upto a page size write must be atomic to an empty pipe. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-03 12:44:30 +02:00
David Woodhouse	e72e6497e7	jffs2: Fix NFS race by using insert_inode_locked() New inodes need to be locked as we're creating them, so they don't get used by other things (like NFSd) before they're ready. Pointed out by Al Viro. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2010-06-03 08:22:04 +01:00
David Woodhouse	f324e4cb2c	jffs2: Fix in-core inode leaks on error paths Pointed out by Al Viro. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2010-06-03 08:03:39 +01:00
Christoph Hellwig	f936972949	xfs: improve xfs_isilocked Use rwsem_is_locked to make the assertations for shared locks work. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-06-03 16:22:29 +10:00
Christoph Hellwig	070ecdca54	xfs: skip writeback from reclaim context Allowing writeback from reclaim context causes massive problems with stack overflows as we can call into the writeback code which tends to be a heavy stack user both in the generic code and XFS from random contexts that perform memory allocations. Follow the example of btrfs (and in slightly different form ext4) and refuse to write out data from reclaim context. This issue should really be handled by the VM so that we can tune better for this case, but until we get it sorted out there we have to hack around this in each filesystem with a complex writeback path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-06-03 16:22:29 +10:00
Dave Chinner	5b257b4a1f	xfs: fix race in inode cluster freeing failing to stale inodes When an inode cluster is freed, it needs to mark all inodes in memory as XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes have a different life cycle to the buffer, and once the buffer is torn down during transaction completion, we must ensure none of the inodes get written back (which is what XFS_ISTALE does). Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these inodes either during inode reclaim or tail pushing on the AIL. The buffer is read back, but no longer contains inodes and so triggers assert failures and shutdowns. This was reproducable with at run.dbench10 invocation from xfstests. There are two main causes of xfs_ifree_cluster() failing. The first is simple - it checks in-memory inodes it finds in the per-ag icache to see if they are clean without holding the flush lock. if they are clean it skips them completely. However, If an inode is flushed delwri, it will appear clean, but is not guaranteed to be written back until the flush lock has been dropped. Hence we may have raced on the clean check and the inode may actually be dirty. Hence always mark inodes found in memory stale before we check properly if they are clean. The second is more complex, and makes the first problem easier to hit. Basically the in-memory inode scan is done with full knowledge it can be racing with inode flushing and AIl tail pushing, which means that inodes that it can't get the flush lock on might not be attached to the buffer after then in-memory inode scan due to IO completion occurring. This is actually documented in the code as "needs better interlocking". i.e. this is a zero-day bug. Effectively, the in-memory scan must be done while the inode buffer is locked and Io cannot be issued on it while we do the in-memory inode scan. This ensures that inodes we couldn't get the flush lock on are guaranteed to be attached to the cluster buffer, so we can then catch all in-memory inodes and mark them stale. Now that the inode cluster buffer is locked before the in-memory scan is done, there is no need for the two-phase update of the in-memory inodes, so simplify the code into two loops and remove the allocation of the temporary buffer used to hold locked inodes across the phases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-06-03 16:22:29 +10:00
Theodore Ts'o	1f5a81e41f	ext4: Make sure the MOVE_EXT ioctl can't overwrite append-only files Dan Roseberg has reported a problem with the MOVE_EXT ioctl. If the donor file is an append-only file, we should not allow the operation to proceed, lest we end up overwriting the contents of an append-only file. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dan Rosenberg <dan.j.rosenberg@gmail.com>	2010-06-02 22:04:39 -04:00
Sage Weil	558d3499bd	ceph: fix f_namelen reported by statfs We were setting f_namelen in kstatfs to PATH_MAX instead of NAME_MAX. That disagrees with ceph_lookup behavior (which checks against NAME_MAX), and also makes the pjd posix test suite spit out ugly errors because with can't clean up its temporary files. Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-01 16:56:03 -07:00
Yehuda Sadeh	205475679a	ceph: fix memory leak in statfs Freeing the statfs request structure when required. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-01 16:56:02 -07:00
Henry C Chang	13a4214cd9	ceph: fix d_subdirs ordering problem We misused list_move_tail() to order the dentry in d_subdirs. This will screw up the d_subdirs order. This bug can be reliably reproduced by: 1. mount ceph fs. 2. on ceph fs, git clone git://ceph.newdream.net/git/ceph.git 3. Run autogen.sh in ceph directory. (Note: Errors only occur at the first time you run autogen.sh.) Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-01 16:55:55 -07:00
Christoph Hellwig	b160fdabe9	nfsd: nfsd_setattr needs to call commit_metadata The conversion of write_inode_now calls to commit_metadata in commit `f501912a35` missed out the call in nfsd_setattr. But without this conversion we can't guarantee that a SETATTR request has actually been commited to disk with XFS, which causes a regression from 2.6.32 (only for NFSv2, but anyway). Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-06-01 19:17:50 -04:00
Dan Carpenter	08a66859e6	FS-Cache: Remove unneeded null checks fscache_write_op() makes unnecessary checks of the page variable to see if it is NULL. It can't be NULL at those points as the kernel would already have crashed a little higher up where we examined page->index. Furthermore, unless radix_tree_gang_lookup_tag() can return 1 but no page, a NULL pointer crash should not be encountered there as we can only get there if r_t_g_l_t() returned 1. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-01 13:32:11 -07:00
Jeff Layton	06b43672a9	cifs: fix page refcount leak Commit `315e995c63` is causing OOM kills when stress-testing a CIFS filesystem. The VFS readpages operation takes a page reference. The older code just handed this reference off to the page cache, but the new code takes an extra one. The simplest fix is to put the new reference after add_to_page_cache_lru. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-06-01 17:15:52 +00:00
Denis Kirjanov	037776fcbe	AFS: Fix possible null pointer dereference in afs_alloc_server() Fix a possible null pointer dereference in afs_alloc_server(): the server pointer is NULL if there was an allocation failure, and under such a condition, we can't dereference it in the _leave() statement. Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-01 09:26:36 -07:00
Takuya Yoshikawa	e30c7c3b30	binfmt_elf_fdpic: Fix clear_user() error handling clear_user() returns the number of bytes that could not be copied rather than an error code. So we should return -EFAULT rather than directly returning the results. Without this patch, positive values may be returned to elf_fdpic_map_file() and the following error handlings do not function as expected. 1. ret = elf_fdpic_map_file_constdisp_on_uclinux(params, file, mm); if (ret < 0) return ret; 2. ret = elf_fdpic_map_file_by_direct_mmap(params, file, mm); if (ret < 0) return ret; Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Mike Frysinger <vapier@gentoo.org> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Andrew Morton <akpm@linux-foundation.org> CC: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> CC: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-01 08:11:06 -07:00
Jens Axboe	b4ca761577	Merge branch 'master' into for-linus Conflicts: fs/pipe.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-01 12:42:12 +02:00
Jens Axboe	0e3c9a2284	Revert "writeback: fix WB_SYNC_NONE writeback from umount" This reverts commit `e913fc825d`. We are investigating a hang associated with the WB_SYNC_NONE changes, so revert them for now. Conflicts: fs/fs-writeback.c mm/page-writeback.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-01 11:08:43 +02:00
Jens Axboe	f17625b318	Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync" This reverts commit `7c8a3554c6`. We are investigating a hang associated with the WB_SYNC_NONE changes, so revert them for now. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-01 11:05:22 +02:00
Ryusuke Konishi	c29684d683	nilfs2: remove obsolete declarations of cache constructor and destructor The commit `41c88bd7` ("nilfs2: cleanup multi kmem_cache_{create,destroy} code") consolidated slab constructors and destructors used in nilfs, but it left some declarations in header files. This gets rid of the obsolete declarations. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-05-31 20:50:29 +09:00
Ryusuke Konishi	84cb099985	nilfs2: fix style issue in nilfs_destroy_cachep This gets rid of unwanted space chars in front of conditional sentences of nilfs_destroy_cachep(). Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-05-31 20:50:29 +09:00
Linus Torvalds	003386fff3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: mm: export generic_pipe_buf_() to modules fuse: support splice() reading from fuse device fuse: allow splice to move pages mm: export remove_from_page_cache() to modules mm: export lru_cache_add_() to modules fuse: support splice() writing to fuse device fuse: get page reference for readpages fuse: use get_user_pages_fast() fuse: remove unneeded variable	2010-05-30 09:16:14 -07:00
Linus Torvalds	d28619f156	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: quota: Convert quota statistics to generic percpu_counter ext3 uses rb_node = NULL; to zero rb_root. quota: Fixup dquot_transfer reiserfs: Fix resuming of quotas on remount read-write pohmelfs: Remove dead quota code ufs: Remove dead quota code udf: Remove dead quota code quota: rename default quotactl methods to dquot_ quota: explicitly set ->dq_op and ->s_qcop quota: drop remount argument to ->quota_on and ->quota_off quota: move unmount handling into the filesystem quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers quota: move remount handling into the filesystem ocfs2: Fix use after free on remount read-only Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c	2010-05-30 09:11:11 -07:00
Linus Torvalds	b612a05537	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: clean up on forwarded aborted mds request ceph: fix leak of osd authorizer ceph: close out mds, osd connections before stopping auth ceph: make lease code DN specific fs/ceph: Use ERR_CAST ceph: renew auth tickets before they expire ceph: do not resend mon requests on auth ticket renewal ceph: removed duplicated #includes ceph: avoid possible null dereference ceph: make mds requests killable, not interruptible sched: add wait_for_completion_killable_timeout	2010-05-30 08:56:39 -07:00
Sage Weil	2a8e5e3637	ceph: clean up on forwarded aborted mds request If an mds request is aborted (timeout, SIGKILL), it is left registered to keep our state in sync with the mds. If we get a forward notification, though, we know the request didn't succeed and we can unregister it safely. We were trying to resend it, but then bailing out (and not unregistering) in __do_request. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:42:05 -07:00
Sage Weil	79494d1b9b	ceph: fix leak of osd authorizer Release the ceph_authorizer when releasing osd state. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:42:04 -07:00
Sage Weil	a922d38fd1	ceph: close out mds, osd connections before stopping auth The auth module (part of the mon_client) is needed to free any ceph_authorizer(s) used by the mds and osd connections. Flush the msgr workqueue before stopping monc to ensure that the destroy_authorizer auth op is available when those connections are closed out. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:42:03 -07:00
Sage Weil	dd1c905736	ceph: make lease code DN specific The lease code includes a mask in the CEPH_LOCK_* namespace, but that namespace is changing, and only one mask (formerly _DN == 1) is used, so hard code for that value for now. If we ever extend this code to handle leases over different data types we can extend it accordingly. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:42 -07:00
Julia Lawall	7e34bc524e	fs/ceph: Use ERR_CAST Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of the returned value is the same as the type of the enclosing function. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:41 -07:00
Sage Weil	a41359fa35	ceph: renew auth tickets before they expire We were only requesting renewal after our tickets expire; do so before that. Most of the low-level logic for this was already there; just use it. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:39 -07:00
Sage Weil	09c4d6a7d4	ceph: do not resend mon requests on auth ticket renewal We only want to send pending mon requests when we successfully authenticate. If we are already authenticated, like when we renew our ticket, there is no need to resend pending requests. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:38 -07:00
Andrea Gelmini	984c76908e	ceph: removed duplicated #includes fs/ceph/auth.c: linux/slab.h is included more than once. fs/ceph/super.h: linux/slab.h is included more than once. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:37 -07:00
Sage Weil	e95e9a7ae4	ceph: avoid possible null dereference ac->ops may be null; use protocol id in error message instead. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:36 -07:00
Sage Weil	aa91647c89	ceph: make mds requests killable, not interruptible The underlying problem is that many mds requests can't be restarted. For example, a restarted create() would return -EEXIST if the original request succeeds. However, we do not want a hung MDS to hang the client too. So, use the _killable wait_for_completion variants to abort on SIGKILL but nothing else. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:35 -07:00
Linus Torvalds	9a90e09854	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (27 commits) ACPI: Don't let acpi_pad needlessly mark TSC unstable drivers/acpi/sleep.h: Checkpatch cleanup ACPI: Minor cleanup eliminating redundant PMTIMER_TICKS to NS conversion ACPI: delete unused c-state promotion/demotion data strucutures ACPI: video: fix acpi_backlight=video ACPI: EC: Use kmemdup drivers/acpi: use kasprintf ACPI, APEI, EINJ injection parameters support Add x64 support to debugfs ACPI, APEI, Use ERST for persistent storage of MCE ACPI, APEI, Error Record Serialization Table (ERST) support ACPI, APEI, Generic Hardware Error Source memory error support ACPI, APEI, UEFI Common Platform Error Record (CPER) header Unified UUID/GUID definition ACPI Hardware Error Device (PNP0C33) support ACPI, APEI, PCIE AER, use general HEST table parsing in AER firmware_first setup ACPI, APEI, Document for APEI ACPI, APEI, EINJ support ACPI, APEI, HEST table parsing ACPI, APEI, APEI supporting infrastructure ...	2010-05-28 14:42:18 -07:00
Christoph Hellwig	fb3b504ade	xfs: fix access to upper inodes without inode64 If a filesystem is mounted without the inode64 mount option we should still be able to access inodes not fitting into 32 bits, just not created new ones. For this to work we need to make sure the inode cache radix tree is initialized for all allocation groups, not just those we plan to allocate inodes from. This patch makes sure we initialize the inode cache radix tree for all allocation groups, and also cleans xfs_initialize_perag up a bit to separate the inode32 logical from the general perag structure setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:56 -05:00
Dave Chinner	9b98b6f3e1	xfs: fix might_sleep() warning when initialising per-ag tree The use of radix_tree_preload() only works if the radix tree was initialised without the __GFP_WAIT flag. The per-ag tree uses GFP_NOFS, so does not trigger allocation of new tree nodes from the preloaded array. Hence it enters the allocator with a spinlock held and triggers the might_sleep() warnings. Reported-by; Chris Mason <chris.mason@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:50 -05:00
Julia Lawall	38e712ab3d	fs/xfs/quota: Add missing mutex_unlock Add a mutex_unlock missing on the error path. The use of this lock is balanced elsewhere in the file. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression E1; @@ * mutex_lock(E1,...); <+... when != E1 if (...) { ... when != E1 * return ...; } ...+> * mutex_unlock(E1,...); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:41 -05:00
Huang Weiyi	3bd0946eb1	xfs: remove duplicated #include Remove duplicated #include('s) in fs/xfs/linux-2.6/xfs_quotaops.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:36 -05:00
Li Zefan	f8adb4d574	xfs: convert more trace events to DEFINE_EVENT Use DECLARE_EVENT_CLASS, and save ~15K: text data bss dec hex filename 171949 43028 48 215025 347f1 fs/xfs/linux-2.6/xfs_trace.o.orig 156521 43028 36 199585 30ba1 fs/xfs/linux-2.6/xfs_trace.o No change in functionality. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:31 -05:00
Huang Weiyi	292ec4cf35	xfs: xfs_trace.c: remove duplicated #include Remove duplicated #include('s) in fs/xfs/linux-2.6/xfs_trace.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:24 -05:00
Dave Chinner	07f1a4f5e8	xfs: Check new inode size is OK before preallocating The new xfsqa test 228 tries to preallocate more space than the filesystem contains. it should fail, but instead triggers an assert about lock flags. The failure is due to the size extension failing in vmtruncate() due to rlimit being set. Check this before we start the preallocation to avoid allocating space that will never be used. Also the path through xfs_vn_allocate already holds the IO lock, so it should not be present in the lock flags when the setattr fails. Hence the assert needs to take this into account. This will prevent other such callers from hitting this incorrect ASSERT. (Fixed a reference to "newsize" to read "new_size". -Alex) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:12 -05:00
Christoph Hellwig	fdc07f44c8	xfs: clean up xlog_align Add suggested cleanups to commit 29db3370a1369541d58d692fbfb168b8a0bd7f41 from review that didn't end up being commited. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:36 -05:00
Christoph Hellwig	025101dca4	xfs: cleanup log reservation calculactions Instead of having small helper functions calling big macros do the calculations for the log reservations directly in the functions. These are mostly 1:1 from the macros execept that the macros kept the quota calculations in their callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:30 -05:00
Eric Sandeen	32891b292d	xfs: be more explicit if RT mount fails due to config Recent testers were slightly confused that a realtime mount failed due to missing CONFIG_XFS_RT; we can make that a little more obvious. V2: drop the else as suggested by Christoph Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:24 -05:00
Eric Sandeen	657a4cffde	xfs: replace E2BIG with EFBIG where appropriate Many places in the xfs code return E2BIG when they really mean EFBIG; trying to grow past 16T on a 32 bit machine, for example, says "Argument list too long" rather than "File too large" which is not particularly helpful. Some of these don't make perfect sense as EFBIG either, but still better than E2BIG IMHO. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:16 -05:00
Al Viro	49837a80b3	remove detritus left by "mm: make read_cache_page synchronous" gets minix get_dir_page() in sync with its analogs; back in 2007 Nick has switched read_cache_page() and friends to sync behaviour (i.e. they wait for the page to get unlocked, check if it's uptodate and if it isn't return ERR_PTR(-EIO) instead) and removed the duplicate logics from the callers. In case of fs/minix/dir.c he'd removed only half of that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-28 11:37:41 -04:00
Al Viro	4c9002de32	fix fs/sysv s_dirt handling got broken on ->sync_fs() conversion a year ago, nobody noticed... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:16:05 -04:00
npiggin@suse.de	459f6ed3b8	fat: convert to use the new truncate convention. Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:16:02 -04:00
npiggin@suse.de	737f2e93b9	ext2: convert to use the new truncate convention. I also have commented a possible bug in existing ext2 code, marked with XXX. Cc: linux-ext4@vger.kernel.org Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:15:57 -04:00
Nick Piggin	3322e79a38	fs: convert simple fs to new truncate Convert simple filesystems: ramfs, configfs, sysfs, block_dev to new truncate sequence. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:15:47 -04:00
npiggin@suse.de	15c6fd9786	kill spurious reference to vmtruncate Lots of filesystems calls vmtruncate despite not implementing the old ->truncate method. Switch them to use simple_setsize and add some comments about the truncate code where it seems fitting. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:15:42 -04:00
npiggin@suse.de	7bb46a6734	fs: introduce new truncate sequence Introduce a new truncate calling sequence into fs/mm subsystems. Rather than setattr > vmtruncate > truncate, have filesystems call their truncate sequence from ->setattr if filesystem specific operations are required. vmtruncate is deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced previously should be used. simple_setattr is introduced for simple in-ram filesystems to implement the new truncate sequence. Eventually all filesystems should be converted to implement a setattr, and the default code in notify_change should go away. simple_setsize is also introduced to perform just the ATTR_SIZE portion of simple_setattr (ie. changing i_size and trimming pagecache). To implement the new truncate sequence: - filesystem specific manipulations (eg freeing blocks) must be done in the setattr method rather than ->truncate. - vmtruncate can not be used by core code to trim blocks past i_size in the event of write failure after allocation, so this must be performed in the fs code. - convert usage of helpers block_write_begin, nobh_write_begin, cont_write_begin, and blockdev_direct_IO to use _newtrunc postfixed variants. These avoid calling vmtruncate to trim blocks (see previous). - inode_setattr should not be used. generic_setattr is a new function to be used to copy simple attributes into the generic inode. - make use of the better opportunity to handle errors with the new sequence. Big problem with the previous calling sequence: the filesystem is not called until i_size has already changed. This means it is not allowed to fail the call, and also it does not know what the previous i_size was. Also, generic code calling vmtruncate to truncate allocated blocks in case of error had no good way to return a meaningful error (or, for example, atomically handle block deallocation). Cc: Christoph Hellwig <hch@lst.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:15:33 -04:00
Randy Dunlap	7000d3c424	fs/super: fix kernel-doc warning Fix fs/super.c kernel-doc warning and function notation: Warning(fs/super.c:957): No description found for parameter 'sb' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:06:23 -04:00
Erik van der Kouwe	0ab7620a0c	fs/minix: bugfix, number of indirect block ptrs per block depends on block size The MINIX filesystem driver used a constant number of indirect block pointers in an indirect block. This worked only for filesystems with 1kb block, while the MINIX default block size is now 4kb. As a consequence, large files were read incorrectly on such filesystems and writing a large file would cause the filesystem to become corrupted. This patch computes the number of indirect block pointers based on the block size, making the driver work for each block size. I would like to thank Feiran Zheng ('Fam') for pointing out the cause of the corruption. Signed-off-by: Erik van der Kouwe <vdkouwe@cs.vu.nl> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:06:22 -04:00
Christoph Hellwig	1b061d9247	rename the generic fsync implementations We don't name our generic fsync implementations very well currently. The no-op implementation for in-memory filesystems currently is called simple_sync_file which doesn't make too much sense to start with, the the generic one for simple filesystems is called simple_fsync which can lead to some confusion. This patch renames the generic file fsync method to generic_file_fsync to match the other generic_file_* routines it is supposed to be used with, and the no-op implementation to noop_fsync to make it obvious what to expect. In addition add some documentation for both methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:06:06 -04:00
Christoph Hellwig	7ea8085910	drop unused dentry argument to ->fsync Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:05:02 -04:00
Julia Lawall	cc967be547	fs: Add missing mutex_unlock Add a mutex_unlock missing on the error path. At other exists from the function that return an error flag, the mutex is unlocked, so do the same here. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression E1; @@ * mutex_lock(E1,...); <+... when != E1 if (...) { ... when != E1 * return ...; } ...+> * mutex_unlock(E1,...); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:03:09 -04:00
Al Viro	d7065da038	get rid of the magic around f_count in aio __aio_put_req() plays sick games with file refcount. What it wants is fput() from atomic context; it's almost always done with f_count > 1, so they only have to deal with delayed work in rare cases when their reference happens to be the last one. Current code decrements f_count and if it hasn't hit 0, everything is fine. Otherwise it keeps a pointer to struct file (with zero f_count!) around and has delayed work do __fput() on it. Better way to do it: use atomic_long_add_unless( , -1, 1) instead of !atomic_long_dec_and_test(). IOW, decrement it only if it's not the last reference, leave refcount alone if it was. And use normal fput() in delayed work. I've made that atomic_long_add_unless call a new helper - fput_atomic(). Drops a reference to file if it's safe to do in atomic (i.e. if that's not the last one), tells if it had been able to do that. aio.c converted to it, __fput() use is gone. req->ki_file always contributes to refcount now. And __fput() became static. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:03:07 -04:00
Neil Brown	176306f59a	VFS: fix recent breakage of FS_REVAL_DOT Commit `1f36f774b2` broke FS_REVAL_DOT semantics. In particular, before this patch, the command ls -l in an NFS mounted directory would always check if the directory on the server had changed and if so would flush and refill the pagecache for the dir. After this patch, the same "ls -l" will repeatedly return stale date until the cached attributes for the directory time out. The following patch fixes this by ensuring the d_revalidate is called by do_last when "." is being looked-up. link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN is not set so nfs_lookup_verify_inode chooses not to do any validation. The following patch restores the original behaviour. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:03:06 -04:00
Al Viro	1eb2cbb6d5	Revert "anon_inode: set S_IFREG on the anon_inode" This reverts commit `a7cf4145bb`.	2010-05-27 22:03:05 -04:00
Linus Torvalds	105a048a4f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits) Btrfs: add more error checking to btrfs_dirty_inode Btrfs: allow unaligned DIO Btrfs: drop verbose enospc printk Btrfs: Fix block generation verification race Btrfs: fix preallocation and nodatacow checks in O_DIRECT Btrfs: avoid ENOSPC errors in btrfs_dirty_inode Btrfs: move O_DIRECT space reservation to btrfs_direct_IO Btrfs: rework O_DIRECT enospc handling Btrfs: use async helpers for DIO write checksumming Btrfs: don't walk around with task->state != TASK_RUNNING Btrfs: do aio_write instead of write Btrfs: add basic DIO read/write support direct-io: do not merge logically non-contiguous requests direct-io: add a hook for the fs to provide its own submit_bio function fs: allow short direct-io reads to be completed via buffered IO Btrfs: Metadata ENOSPC handling for balance Btrfs: Pre-allocate space for data relocation Btrfs: Metadata ENOSPC handling for tree log Btrfs: Metadata reservation for orphan inodes Btrfs: Introduce global metadata reservation ...	2010-05-27 10:43:44 -07:00
Linus Torvalds	e4ce30f377	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits) ext4: Make fsync sync new parent directories in no-journal mode ext4: Drop whitespace at end of lines ext4: Fix compat EXT4_IOC_ADD_GROUP ext4: Conditionally define compat ioctl numbers tracing: Convert more ext4 events to DEFINE_EVENT ext4: Add new tracepoints to track mballoc's buddy bitmap loads ext4: Add a missing trace hook ext4: restart ext4_ext_remove_space() after transaction restart ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted ext4: Avoid crashing on NULL ptr dereference on a filesystem error ext4: Use bitops to read/modify i_flags in struct ext4_inode_info ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE() ext4: Convert callers of ext4_get_blocks() to use ext4_map_blocks() ext4: Add new abstraction ext4_map_blocks() underneath ext4_get_blocks() ext4: Use our own write_cache_pages() ext4: Show journal_checksum option ext4: Fix for ext4_mb_collect_stats() ext4: check for a good block group before loading buddy pages ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate ext4: Remove extraneous newlines in ext4_msg() calls ... Fixed up trivial conflict in fs/ext4/fsync.c	2010-05-27 10:26:37 -07:00
Linus Torvalds	ade61088bc	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Fix another nfs_wb_page() deadlock NFS: Ensure that we mark the inode as dirty if we exit early from commit NFS: Fix a lock imbalance typo in nfs_access_cache_shrinker sunrpc: fix leak on error on socket xprt setup	2010-05-27 10:18:44 -07:00
Dmitry Monakhov	f32764bd2b	quota: Convert quota statistics to generic percpu_counter Generic per-cpu counter has some memory overhead but it is negligible for modern systems and embedded systems compile without quota support. And code reuse is a good thing. This patch should fix complain from preemptive kernels which was introduced by `dde9588853`. [Jan Kara: Fixed patch to work on 32-bit archs as well] Reported-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-27 18:56:27 +02:00
jan Blunck	ca572727db	fs/: do not fallback to default_llseek() when readdir() uses BKL Do not use the fallback default_llseek() if the readdir operation of the filesystem still uses the big kernel lock. Since llseek() modifies file->f_pos of the directory directly it may need locking to not confuse readdir which usually uses file->f_pos directly as well Since the special characteristics of the BKL (unlocked on schedule) are not necessary in this case, the inode mutex can be used for locking as provided by generic_file_llseek(). This is only possible since all filesystems, except reiserfs, either use a directory as a flat file or with disk address offsets. Reiserfs on the other hand uses a 32bit hash off the filename as the offset so generic_file_llseek() can get used as well since the hash is always smaller than sb->s_maxbytes (= (512 << 32) - blocksize). Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Anders Larsen <al@alarsen.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:56 -07:00
jan Blunck	ae6afc3f5c	vfs: introduce noop_llseek() This is an implementation of ->llseek useable for the rare special case when userspace expects the seek to succeed but the (device) file is actually not able to perform the seek. In this case you use noop_llseek() instead of falling back to the default implementation of ->llseek. Signed-off-by: Jan Blunck <jblunck@suse.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:56 -07:00
Jeff Moyer	9d85cba718	aio: fix the compat vectored operations The aio compat code was not converting the struct iovecs from 32bit to 64bit pointers, causing either EINVAL to be returned from io_getevents, or EFAULT as the result of the I/O. This patch passes a compat flag to io_submit to signal that pointer conversion is necessary for a given iocb array. A variant of this was tested by Michael Tokarev. I have also updated the libaio test harness to exercise this code path with good success. Further, I grabbed a copy of ltp and ran the testcases/kernel/syscall/readv and writev tests there (compiled with -m32 on my 64bit system). All seems happy, but extra eyes on this would be welcome. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> [2.6.35.1] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:53 -07:00
Jeff Moyer	b83733639a	compat: factor out compat_rw_copy_check_uvector from compat_do_readv_writev It was reported in http://lkml.org/lkml/2010/3/8/309 that 32 bit readv and writev AIO operations were not functioning properly. It turns out that the code to convert the 32bit io vectors to 64 bits was never written. The results of that can be pretty bad, but in my testing, it mostly ended up in generating EFAULT as we walked off the list of I/O vectors provided. This patch set fixes the problem in my environment. are greatly appreciated. This patch: Factor out code that will be used by both compat_do_readv_writev and the compat aio submission code paths. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> [2.6.35.1] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:53 -07:00
Julia Lawall	cccad8f9f0	fs/affs: use ERR_CAST Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:53 -07:00
Wu Fengguang	36e15263aa	kcore: add _text to KCORE_TEXT Extend KCORE_TEXT to cover the pages between _text and _stext, to allow examining some important page table pages. `readelf -a` output on x86_64 before and after patch: Type Offset VirtAddr PhysAddr before LOAD 0x00007fff8100c000 0xffffffff81009000 0x0000000000000000 after LOAD 0x00007fff81003000 0xffffffff81000000 0x0000000000000000 The newly covered pages are: 0xffffffff81000000 <startup_64> etc. 0xffffffff81001000 <init_level4_pgt> 0xffffffff81002000 <level3_ident_pgt> 0xffffffff81003000 <level3_kernel_pgt> 0xffffffff81004000 <level2_fixmap_pgt> 0xffffffff81005000 <level1_fixmap_pgt> 0xffffffff81006000 <level2_ident_pgt> 0xffffffff81007000 <level2_kernel_pgt> 0xffffffff81008000 <level2_spare_pgt> Before patch, /proc/kcore shows outdated contents for the above page table pages, for example: (gdb) p level3_ident_pgt $1 = {<text variable, no debug info>} 0xffffffff81002000 <level3_ident_pgt> (gdb) p/x ((pud_t )&level3_ident_pgt)@512 $2 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>} while the real content is: root@hp /home/wfg# hexdump -s 0x1002000 -n 4096 /dev/mem 1002000 6063 0100 0000 0000 8067 0000 0000 0000 1002010 0000 0000 0000 0000 0000 0000 0000 0000 * 1003000 That is, on a x86_64 box with 2GB memory, we can see first-1GB / full-2GB identity mapping before/after patch: (gdb) p/x ((pud_t )&level3_ident_pgt)@512 before $1 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>} after $1 = {{pud = 0x1006063}, {pud = 0x8067}, {pud = 0x0} <repeats 510 times>} Obviously the content before patch is wrong. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Amerigo Wang	57f87869f0	proc: remove obsolete comments A quick test shows these comments are obsolete, so just remove them. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Dan Carpenter	73d3646029	proc: cleanup: remove unused assignments I removed 3 unused assignments. The first two get reset on the first statement of their functions. For "err" in root.c we don't return an error and we don't use the variable again. Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	7e49827cc9	proc: get_nr_threads() doesn't need ->siglock any longer Now that task->signal can't go away get_nr_threads() doesn't need ->siglock to read signal->count. Also, make it inline, move into sched.h, and convert 2 other proc users of signal->count to use this (now trivial) helper. Henceforth get_nr_threads() is the only valid user of signal->count, we are ready to turn it into "int nr_threads" or, perhaps, kill it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	d344193a05	exit: avoid sig->count in de_thread/__exit_signal synchronization de_thread() and __exit_signal() use signal_struct->count/notify_count for synchronization. We can simplify the code and use ->notify_count only. Instead of comparing these two counters, we can change de_thread() to set ->notify_count = nr_of_sub_threads, then change __exit_signal() to dec-and-test this counter and notify group_exit_task. Note that __exit_signal() checks "notify_count > 0" just for symmetry with exit_notify(), we could just check it is != 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	269b005a28	coredump: shift down_write(mmap_sem) into coredump_wait() - move the cprm.mm_flags checks up, before we take mmap_sem - move down_write(mmap_sem) and ->core_state check from do_coredump() to coredump_wait() This simplifies the code and makes the locking symmetrical. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	5e43aef530	coredump: factor out put_cred() calls Given that do_coredump() calls put_cred() on exit path, it is a bit ugly to do put_cred() + "goto fail" twice, just add the new "fail_creds" label. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	d5bf4c4f5f	coredump: cleanup "ispipe" code - kill "int dump_count", argv_split(argcp) accepts argcp == NULL. - move "int dump_count" under " if (ispipe)" branch, fail_dropcount can check ispipe. - move "char **helper_argv" as well, change the code to do argv_free() right after call_usermodehelper_fns(). - If call_usermodehelper_fns() fails goto close_fail label instead of closing the file by hand. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	c713541125	coredump: factor out the not-ispipe file checks do_coredump() does a lot of file checks after it opens the file or calls usermode helper. But all of these checks are only needed in !ispipe case. Move this code into the "else" branch and kill the ugly repetitive ispipe checks. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Neil Horman	898b374af6	exec: replace call_usermodehelper_pipe with use of umh init function and resolve limit The first patch in this series introduced an init function to the call_usermodehelper api so that processes could be customized by caller. This patch takes advantage of that fact, by customizing the helper in do_coredump to create the pipe and set its core limit to one (for our recusrsion check). This lets us clean up the previous uglyness in the usermodehelper internals and factor call_usermodehelper out entirely. While I'm at it, we can also modify the helper setup to look for a core limit value of 1 rather than zero for our recursion check Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Thomas Stewart	d27d7a9a78	ufs: permit mounting of BorderWare filesystems I recently had to recover some files from an old broken machine that was running BorderWare Document Gateway. It's basically a drop in web server for sharing files. From the look of the init process and using strings on of a few files it seems to be based on FreeBSD 3.3. The process turned out to be more difficult than I imagined, but to cut a long story short BorderWare in their wisdom use a nonstandard magic number in their UFS (ufstype=44bsd) file systems. Thus Linux refuses to mount the file systems in order to recover the data. After a bit of hunting I was able to make a quick fix to fs/ufs/super.c in order to detect the new magic number. I assume that this number is the same for all installations. It's quite easy to find out from ufs_fs.h. The superblock sits 8k into the block device and the magic number its 1372 bytes into the superblock struct. # dd if=/dev/sda5 skip=$(( 8192 + 1372 )) bs=1 count=4 2> /dev/null \| hd 00000000 97 26 24 0f \|.&$.\| # Signed-off-by: Thomas Stewart <thomas@stewarts.org.uk> Cc: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:43 -07:00
Julia Lawall	7ca5ca60cb	fs/autofs4: use memdup_user Use memdup_user when user data is immediately copied into the allocated region. Elimination of the variable ads, which is no longer useful. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression from,to,size,flag; position p; identifier l1,l2; @@ - to = $kmalloc@p\\|kzalloc@p$(size,flag); + to = memdup_user(from,size); if ( - to==NULL + IS_ERR(to) \|\| ...) { <+... when != goto l1; - -ENOMEM + PTR_ERR(to) ...+> } - if (copy_from_user(to, from, size) != 0) { - <+... when != goto l2; - -EFAULT - ...+> - } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:41 -07:00
Venkatesh Pallipadi	1513b02c8b	ext3 uses rb_node = NULL; to zero rb_root. The problem with this is that `17d9ddc72f` ("rbtree: Add support for augmented rbtrees") in the linux-next tree adds a new field to that struct which needs to be NULLas well. This patch uses RB_ROOT as the intializer so all of the relevant fields will be NULL'd. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Eric Paris <eparis@redhat.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-27 17:39:36 +02:00
Jan Kara	4dea496974	quota: Fixup dquot_transfer Commit `bc8e5f0739` had a typo which caused quota miscomputation when changing owner group of a file. Linus will hate me. Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-27 17:39:36 +02:00
Jan Kara	f4b113ae6f	reiserfs: Fix resuming of quotas on remount read-write When quota was suspended on remount-ro, finish_unfinished() will try to turn it on again (which fails) and also turns the quotas off on exit. Fix the function to check whether quotas are already on at function entry and do not turn them off in that case. CC: reiserfs-devel@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-27 17:39:36 +02:00
Chris Mason	9aeead7378	Btrfs: add more error checking to btrfs_dirty_inode The ENOSPC code will now return ENOSPC to btrfs_start_transaction. btrfs_dirty_inode needs to check for this and error out appropriately. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-27 10:23:00 -04:00
Chris Mason	5a5f79b570	Btrfs: allow unaligned DIO In order to support DIO that isn't aligned to the filesystem blocksize, we fall back to buffered for any unaligned DIOs. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 21:35:35 -04:00
Chris Mason	933b585f70	Btrfs: drop verbose enospc printk Less printk is good printk. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 21:35:34 -04:00
Yan, Zheng	5bdd3536cb	Btrfs: Fix block generation verification race After the path is released, the generation number got from block pointer is no long valid. The race may cause disk corruption, because verify_parent_transid() calls clear_extent_buffer_uptodate() when generation numbers mismatch. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 21:35:33 -04:00
Chris Mason	46bfbb5c07	Btrfs: fix preallocation and nodatacow checks in O_DIRECT The O_DIRECT code wasn't checking for multiple references on preallocated or nodatacow extents. This means it wasn't honoring snapshots properly. The fix here is to add an explicit check for multiple references This also fixes the math for selecting the correct disk block, making sure not to go past the end of the extent. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 21:34:45 -04:00
Linus Torvalds	63a6440326	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: squashfs: update documentation to include description of xattr layout squashfs: fix name reading in squashfs_xattr_get squashfs: constify xattr handlers squashfs: xattr fix sparse warnings squashfs: xattr_lookup sparse fix squashfs: add xattr support configure option squashfs: add new extended inode types squashfs: add support for xattr reading squashfs: add xattr id support	2010-05-26 08:57:20 -07:00
Andrew Morton	cc68e3be74	fs/fscache/object-list.c: fix warning on 32-bit fs/fscache/object-list.c: In function 'fscache_objlist_lookup': fs/fscache/object-list.c:105: warning: cast to pointer from integer of different size Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-26 08:19:23 -07:00
Chris Mason	94b604429a	Btrfs: avoid ENOSPC errors in btrfs_dirty_inode btrfs_dirty_inode tries to sneak in without much waiting or space reservation, mostly for performance reasons. This usually works well but can cause problems when there are many many writers. When btrfs_update_inode fails with ENOSPC, we fallback to a slower btrfs_start_transaction call that will reserve some space. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 11:02:00 -04:00
Chris Mason	3f7c579c41	Btrfs: move O_DIRECT space reservation to btrfs_direct_IO This moves the delalloc space reservation done for O_DIRECT into btrfs_direct_IO. This way we don't leak reserved space if the generic O_DIRECT write code errors out before it calls into btrfs_direct_IO. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 10:59:53 -04:00
Trond Myklebust	0522f6aded	NFS: Fix another nfs_wb_page() deadlock J.R. Okajima reports that the call to sync_inode() in nfs_wb_page() can deadlock with other writeback flush calls. It boils down to the fact that we cannot ever call writeback_single_inode() while holding a page lock (even if we do set nr_to_write to zero) since another process may already be waiting in the call to do_writepages(), and so will deny us the I_SYNC lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-05-26 08:43:53 -04:00
Trond Myklebust	c5efa5fc91	NFS: Ensure that we mark the inode as dirty if we exit early from commit If we exit from nfs_commit_inode() without ensuring that the COMMIT rpc call has been completed, we must re-mark the inode as dirty. Otherwise, future calls to sync_inode() with the WB_SYNC_ALL flag set will fail to ensure that the data is on the disk. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-05-26 08:43:52 -04:00
Trond Myklebust	59844a9bd7	NFS: Fix a lock imbalance typo in nfs_access_cache_shrinker Commit `9c7e7e2337` (NFS: Don't call iput() in nfs_access_cache_shrinker) unintentionally removed the spin unlock for the inode->i_lock. Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-05-26 08:43:51 -04:00
Miklos Szeredi	51921cb746	mm: export generic_pipe_buf_*() to modules This is needed by fuse device code which wants to create pipe buffers. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-26 08:44:22 +02:00
Chris Mason	4845e44ffd	Btrfs: rework O_DIRECT enospc handling This changes O_DIRECT write code to mark extents as delalloc while it is processing them. Yan Zheng has reworked the enospc accounting based on tracking delalloc extents and this makes it much easier to track enospc in the O_DIRECT code. There are a few space cases with the O_DIRECT code though, it only sets the EXTENT_DELALLOC bits, instead of doing EXTENT_DELALLOC \| EXTENT_DIRTY \| EXTENT_UPTODATE, because we don't want to mess with clearing the dirty and uptodate bits when things go wrong. This is important because there are no pages in the page cache, so any extent state structs that we put in the tree won't get freed by releasepage. We have to clear them ourselves as the DIO ends. With this commit, we reserve space at in btrfs_file_aio_write, and then as each btrfs_direct_IO call progresses it sets EXTENT_DELALLOC on the range. btrfs_get_blocks_direct is responsible for clearing the delalloc at the same time it drops the extent lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 21:52:08 -04:00
Kay Sievers	578454ff7e	driver core: add devname module aliases to allow module on-demand auto-loading This adds: alias: devname:<name> to some common kernel modules, which will allow the on-demand loading of the kernel module when the device node is accessed. Ideally all these modules would be compiled-in, but distros seems too much in love with their modularization that we need to cover the common cases with this new facility. It will allow us to remove a bunch of pretty useless init scripts and modprobes from init scripts. The static device node aliases will be carried in the module itself. The program depmod will extract this information to a file in the module directory: $ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname # Device nodes to trigger on-demand module loading. microcode cpu/microcode c10:184 fuse fuse c10:229 ppp_generic ppp c108:0 tun net/tun c10:200 dm_mod mapper/control c10:235 Udev will pick up the depmod created file on startup and create all the static device nodes which the kernel modules specify, so that these modules get automatically loaded when the device node is accessed: $ /sbin/udevd --debug ... static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184 static_dev_create_from_modules: mknod '/dev/fuse' c10:229 static_dev_create_from_modules: mknod '/dev/ppp' c108:0 static_dev_create_from_modules: mknod '/dev/net/tun' c10:200 static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235 udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666 udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666 A few device nodes are switched to statically allocated numbers, to allow the static nodes to work. This might also useful for systems which still run a plain static /dev, which is completely unsafe to use with any dynamic minor numbers. Note: The devname aliases must be limited to the common and singleinstance* device nodes, like the misc devices, and never be used for conceptually limited systems like the loop devices, which should rather get fixed properly and get a control node for losetup to talk to, instead of creating a random number of device nodes in advance, regardless if they are ever used. This facility is to hide the mess distros are creating with too modualized kernels, and just to hide that these modules are not compiled-in, and not to paper-over broken concepts. Thanks! :) Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: David S. Miller <davem@davemloft.net> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Cc: Ian Kent <raven@themaw.net> Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-25 15:08:26 -07:00
Linus Torvalds	f16a5e3478	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Fix permissions checking for setflags ioctl() GFS2: Don't "get" xattrs for ACLs when ACLs are turned off GFS2: Rework reclaiming unlinked dinodes	2010-05-25 08:17:51 -07:00
Linus Torvalds	110b93842e	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: Ensure inode allocation buffers are fully replayed xfs: enable background pushing of the CIL xfs: forced unmounts need to push the CIL xfs: Introduce delayed logging core code xfs: Delayed logging design documentation xfs: Improve scalability of busy extent tracking xfs: make the log ticket ID available outside the log infrastructure xfs: clean up log ticket overrun debug output xfs: Clean up XFS_BLI_* flag namespace xfs: modify buffer item reference counting xfs: allow log ticket allocation to take allocation flags xfs: Don't reuse the same transaction ID for duplicated transactions.	2010-05-25 08:17:01 -07:00
Huang Weiyi	337bbfdbff	smbfs: remove duplicated #include Remove duplicated #include('s) in fs/smbfs/symlink.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:07 -07:00
Andy Shevchenko	91f06e6680	fs: ldm: don't use own implementation of hex_to_bin() Remove own implementation of hex_to_bin(). Signed-off-by: Andy Shevchenko <ext-andriy.shevchenko@nokia.com> Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:06 -07:00
OGAWA Hirofumi	aaa04b4875	fatfs: ratelimit corruption report Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:04 -07:00
Minchan Kim	4c99000ac4	ntfs: use add_to_page_cache_lru() Quote from Nick piggin's about btrfs patch - http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg04472.html. "add_to_page_cache_lru is exported, so it should be used. Benefits over using a private pagevec: neater code, 128 bytes fewer stack used, percpu lru ordering is preserved, and finally don't need to flush pagevec before returning so batching may be shared with other LRU insertions." Let's use it instead of private pagevec in ntfs, too. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Anton Altaparmakov <aia21@cantab.net> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:03 -07:00
Minchan Kim	2ec93b0bf3	ntfs: clean up ntfs_attr_extend_initialized cached_page and lru_pvec have not been used. Let's remove the arguments. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:03 -07:00
Alexey Dobriyan	4be929be34	kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:02 -07:00
Richard Kennedy	58a9d3d8db	fs-writeback: check sync bit earlier in inode_wait_for_writeback When wb_writeback() hasn't written anything it will re-acquire the inode lock before calling inode_wait_for_writeback. This change tests the sync bit first so that is doesn't need to drop & re-acquire the lock if the inode became available while wb_writeback() was waiting to get the lock. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:00 -07:00
Mel Gorman	a8bef8ff6e	mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Page migration requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it is possible to leave one behind if the second rmap_walk fails to find the page. If this page is later faulted, migration_entry_to_page() will call BUG because the page is locked indicating the page was migrated by the migration PTE not cleaned up. For example kernel BUG at include/linux/swapops.h:105! invalid opcode: 0000 [#1] PREEMPT SMP ... Call Trace: [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e [<ffffffff813099b5>] page_fault+0x25/0x30 [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b [<ffffffff8111329b>] search_binary_handler+0x173/0x313 [<ffffffff81114896>] do_execve+0x219/0x30a [<ffffffff8100a5c6>] sys_execve+0x43/0x5e [<ffffffff8100320a>] stub_execve+0x6a/0xc0 RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129 There is a race between shift_arg_pages and migration that triggers this bug. A temporary stack is setup during exec and later moved. If migration moves a page in the temporary stack and the VMA is then removed before migration completes, the migration PTE may not be found leading to a BUG when the stack is faulted. This patch causes pages within the temporary stack during exec to be skipped by migration. It does this by marking the VMA covering the temporary stack with an otherwise impossible combination of VMA flags. These flags are cleared when the temporary stack is moved to its final location. [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:59 -07:00
Naoya Horiguchi	1a5cb81465	pagemap: add #ifdefs CONFIG_HUGETLB_PAGE on code walking hugetlb vma If !CONFIG_HUGETLB_PAGE, pagemap_hugetlb_range() is never called. So put it (and its calling function) into #ifdef block. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:58 -07:00
Chris Mason	eaf25d933e	Btrfs: use async helpers for DIO write checksumming The async helper threads offload crc work onto all the CPUs, and make streaming writes much faster. This changes the O_DIRECT write code to use them. The only small complication was that we need to pass in the logical offset in the file for each bio, because we can't find it in the bio's pages. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:58 -04:00
Chris Mason	ed3b3d314c	Btrfs: don't walk around with task->state != TASK_RUNNING Yan Zheng noticed two places we were doing a lot of work without task->state set to TASK_RUNNING. This sets the state properly after we get ready to sleep but decide not to. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:58 -04:00
Josef Bacik	11c65dccf7	Btrfs: do aio_write instead of write In order for AIO to work, we need to implement aio_write. This patch converts our btrfs_file_write to btrfs_aio_write. I've tested this with xfstests and nothing broke, and the AIO stuff magically started working. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:57 -04:00
Josef Bacik	4b46fce233	Btrfs: add basic DIO read/write support This provides basic DIO support for reading and writing. It does not do the work to recover from mismatching checksums, that will come later. A few design changes have been made from Jim's code (sorry Jim!) 1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO code in order to account for all of BTRFS's oddities, but thanks to that work it seems like the best bet is to just ignore compression and such and just opt to fallback on buffered IO. 2) Fallback on buffered IO for compressed or inline extents. Jim's code did it's own buffering to make dio with compressed extents work. Now we just fallback onto normal buffered IO. 3) Use ordered extents for the writes so that all of the lock_extent() lookup_ordered() type checks continue to work. 4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with DIO writes. I've tested this with fsx and everything works great. This patch depends on my dio and filemap.c patches to work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:57 -04:00
Josef Bacik	c2c6ca417e	direct-io: do not merge logically non-contiguous requests Btrfs cannot handle having logically non-contiguous requests submitted. For example if you have Logical: [0-4095][HOLE][8192-12287] Physical: [0-4095] [4096-8191] Normally the DIO code would put these into the same BIO's. The problem is we need to know exactly what offset is associated with what BIO so we can do our checksumming and unlocking properly, so putting them in the same BIO doesn't work. So add another check where we submit the current BIO if the physical blocks are not contigous OR the logical blocks are not contiguous. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:56 -04:00
Josef Bacik	facd07b07d	direct-io: add a hook for the fs to provide its own submit_bio function Because BTRFS can do RAID and such, we need our own submit hook so we can setup the bio's in the correct fashion, and handle checksum errors properly. So there are a few changes here 1) The submit_io hook. This is straightforward, just call this instead of submit_bio. 2) Allow the fs to return -ENOTBLK for reads. Usually this has only worked for writes, since writes can fallback onto buffered IO. But BTRFS needs the option of falling back on buffered IO if it encounters a compressed extent, since we need to read the entire extent in and decompress it. So if we get -ENOTBLK back from get_block we'll return back and fallback on buffered just like the write case. I've tested these changes with fsx and everything seems to work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:55 -04:00
Yan, Zheng	3fd0a5585e	Btrfs: Metadata ENOSPC handling for balance This patch adds metadata ENOSPC handling for the balance code. It is consisted by following major changes: 1. Avoid COW tree leave in the phrase of merging tree. 2. Handle interaction with snapshot creation. 3. make the backref cache can live across transactions. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:54 -04:00
Yan, Zheng	efa5646456	Btrfs: Pre-allocate space for data relocation Pre-allocate space for data relocation. This can detect ENOPSC condition caused by fragmentation of free space. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:53 -04:00
Yan, Zheng	4a500fd178	Btrfs: Metadata ENOSPC handling for tree log Previous patches make the allocater return -ENOSPC if there is no unreserved free metadata space. This patch updates tree log code and various other places to propagate/handle the ENOSPC error. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:53 -04:00
Yan, Zheng	d68fc57b7e	Btrfs: Metadata reservation for orphan inodes reserve metadata space for handling orphan inodes Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:52 -04:00
Yan, Zheng	8929ecfa50	Btrfs: Introduce global metadata reservation Reserve metadata space for extent tree, checksum tree and root tree Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:52 -04:00
Yan, Zheng	0ca1f7ceb1	Btrfs: Update metadata reservation for delayed allocation Introduce metadata reservation context for delayed allocation and update various related functions. This patch also introduces EXTENT_FIRST_DELALLOC control bit for set/clear_extent_bit. It tells set/clear_bit_hook whether they are processing the first extent_state with EXTENT_DELALLOC bit set. This change is important if set/clear_extent_bit involves multiple extent_state. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:51 -04:00
Yan, Zheng	a22285a6a3	Btrfs: Integrate metadata reservation with start_transaction Besides simplify the code, this change makes sure all metadata reservation for normal metadata operations are released after committing transaction. Changes since V1: Add code that check if unlink and rmdir will free space. Add ENOSPC handling for clone ioctl. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:50 -04:00
Yan, Zheng	f0486c68e4	Btrfs: Introduce contexts for metadata reservation Introducing metadata reseravtion contexts has two major advantages. First, it makes metadata reseravtion more traceable. Second, it can reclaim freed space and re-add them to the itself after transaction committed. Besides add btrfs_block_rsv structure and related helper functions, This patch contains following changes: Move code that decides if freed tree block should be pinned into btrfs_free_tree_block(). Make space accounting more accurate, mainly for handling read only block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:50 -04:00
Yan, Zheng	2ead6ae770	Btrfs: Kill init_btrfs_i() All code in init_btrfs_i can be moved into btrfs_alloc_inode() Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:49 -04:00
Yan, Zheng	5da9d01b66	Btrfs: Shrink delay allocated space in a synchronized Shrink delayed allocation space in a synchronized manner is more controllable than flushing all delay allocated space in an async thread. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:48 -04:00
Yan, Zheng	424499dbd0	Btrfs: Kill allocate_wait in space_info We already have fs_info->chunk_mutex to avoid concurrent chunk creation. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:48 -04:00
Yan, Zheng	b742bb82f1	Btrfs: Link block groups of different raid types The size of reserved space is stored in space_info. If block groups of different raid types are linked to separate space_info, changing allocation profile will corrupt reserved space accounting. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:47 -04:00
Miklos Szeredi	c3021629a0	fuse: support splice() reading from fuse device Allow userspace filesystem implementation to use splice() to read from the fuse device. The userspace filesystem can now transfer data coming from a WRITE request to an arbitrary file descriptor (regular file, block device or socket) without having to go through a userspace buffer. The semantics of using splice() to read messages are: 1) with a single splice() call move the whole message from the fuse device to a temporary pipe 2) read the header from the pipe and determine the message type 3a) if message is a WRITE then splice data from pipe to destination 3b) else read rest of message to userspace buffer Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:07 +02:00
Miklos Szeredi	ce534fb052	fuse: allow splice to move pages When splicing buffers to the fuse device with SPLICE_F_MOVE, try to move pages from the pipe buffer into the page cache. This allows populating the fuse filesystem's cache without ever touching the page contents, i.e. zero copy read capability. The following steps are performed when trying to move a page into the page cache: - buf->ops->confirm() to make sure the new page is uptodate - buf->ops->steal() to try to remove the new page from it's previous place - remove_from_page_cache() on the old page - add_to_page_cache_locked() on the new page If any of the above steps fail (non fatally) then the code falls back to copying the page. In particular ->steal() will fail if there are external references (other than the page cache and the pipe buffer) to the page. Also since the remove_from_page_cache() + add_to_page_cache_locked() are non-atomic it is possible that the page cache is repopulated in between the two and add_to_page_cache_locked() will fail. This could be fixed by creating a new atomic replace_page_cache_page() function. fuse_readpages_end() needed to be reworked so it works even if page->mapping is NULL for some or all pages which can happen if the add_to_page_cache_locked() failed. A number of sanity checks were added to make sure the stolen pages don't have weird flags set, etc... These could be moved into generic splice/steal code. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:07 +02:00
Miklos Szeredi	dd3bb14f44	fuse: support splice() writing to fuse device Allow userspace filesystem implementation to use splice() to write to the fuse device. The semantics of using splice() are: 1) buffer the message header and data in a temporary pipe 2) with a single splice() call move the message from the temporary pipe to the fuse device The READ reply message has the most interesting use for this, since now the data from an arbitrary file descriptor (which could be a regular file, a block device or a socket) can be tranferred into the fuse device without having to go through a userspace buffer. It will also allow zero copy moving of pages. One caveat is that the protocol on the fuse device requires the length of the whole message to be written into the header. But the length of the data transferred into the temporary pipe may not be known in advance. The current library implementation works around this by using vmplice to write the header and modifying the header after splicing the data into the pipe (error handling omitted): struct fuse_out_header out; iov.iov_base = &out; iov.iov_len = sizeof(struct fuse_out_header); vmsplice(pip[1], &iov, 1, 0); len = splice(input_fd, input_offset, pip[1], NULL, len, 0); /* retrospectively modify the header: */ out.len = len + sizeof(struct fuse_out_header); splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags); This works since vmsplice only saves a pointer to the data, it does not copy the data itself. Since pipes are currently limited to 16 pages and messages need to be spliced atomically, the length of the data is limited to 15 pages (or 60kB for 4k pages). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:06 +02:00
Miklos Szeredi	b5dd328537	fuse: get page reference for readpages Acquire a page ref on pages in ->readpages() and release them when the read has finished. Not acquiring a reference didn't seem to cause any trouble since the page is locked and will not be kicked out of the page cache during the read. However the following patches will want to remove the page from the cache so a separate ref is needed. Making the reference in req->pages explicit also makes the code easier to understand. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:06 +02:00
Miklos Szeredi	1bf94ca73e	fuse: use get_user_pages_fast() Replace uses of get_user_pages() with get_user_pages_fast(). It looks nicer and should be faster in most cases. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:06 +02:00
Dan Carpenter	4aa0edd294	fuse: remove unneeded variable "map" isn't needed any more after: `0bd87182d3` "fuse: fix kunmap in fuse_ioctl_copy_user" Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:05 +02:00
Nick Piggin	0ae0b5d055	fs/splice.c: fix mapping_gfp_mask usage mapping_gfp_mask() is not supposed to store allocation contex details, only page location details. So mapping_gfp_mask should be applied to the pagecache page allocation, wheras normal (kernel mapped) memory should be used for surrounding allocations such as radix-tree nodes allocated by add_to_page_cache. Context modifiers should be applied on a per-callsite basis. So change splice to follow this convention (which is followed in similar code patterns in core code). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-25 10:25:26 +02:00
Jens Axboe	b9598db340	pipe: make F_{GET,SET}PIPE_SZ deal with byte sizes Instead of requiring an exact number of pages as the argument and return value, change the API to deal with number of bytes instead. This also relaxes the requirement that the passed in size must result in a power-of-2 page array size. Round up to the nearest power-of-2 automatically and return the resulting size of the pipe on success. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-24 19:34:43 +02:00
Jens Axboe	0191f8697b	pipe: F_SETPIPE_SZ should return -EPERM for non-root If the passed in size is larger than what has been set as the system wide limit and the user is not root, we want to return permission denied (not invalid value). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-24 19:15:57 +02:00
Alex Elder	88e88374ee	Merge branch 'delayed-logging-for-2.6.35' into for-linus	2010-05-24 11:57:36 -05:00
Dave Chinner	ccf7c23fc1	xfs: Ensure inode allocation buffers are fully replayed With delayed logging, we can get inode allocation buffers in the same transaction inode unlink buffers. We don't currently mark inode allocation buffers in the log, so inode unlink buffers take precedence over allocation buffers. The result is that when they are combined into the same checkpoint, only the unlinked inode chain fields are replayed, resulting in uninitialised inode buffers being detected when the next inode modification is replayed. To fix this, we need to ensure that we do not set the inode buffer flag in the buffer log item format flags if the inode allocation has not already hit the log. To avoid requiring a change to log recovery, we really need to make this a modification that relies only on in-memory sate. We can do this by checking during buffer log formatting (while the CIL cannot be flushed) if we are still in the same sequence when we commit the unlink transaction as the inode allocation transaction. If we are, then we do not add the inode buffer flag to the buffer log format item flags. This means the entire buffer will be replayed, not just the unlinked fields. We do this while CIL flusheѕ are locked out to ensure that we don't race with the sequence numbers changing and hence fail to put the inode buffer flag in the buffer format flags when we really need to. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:41:22 -05:00
Dave Chinner	df806158b0	xfs: enable background pushing of the CIL If we let the CIL grow without bound, it will grow large enough to violate recovery constraints (must be at least one complete transaction in the log at all times) or take forever to write out through the log buffers. Hence we need a check during asynchronous transactions as to whether the CIL needs to be pushed. We track the amount of log space the CIL consumes, so it is relatively simple to limit it on a pure size basis. Make the limit the minimum of just under half the log size (recovery constraint) or 8MB of log space (which is an awful lot of metadata). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:20 -05:00
Dave Chinner	9da1ab181a	xfs: forced unmounts need to push the CIL If the filesystem is being shut down and the there is no log error, the current code forces out the current log buffers. This code now needs to push the CIL before it forces out the log buffers to acheive the same result. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:14 -05:00
Dave Chinner	71e330b593	xfs: Introduce delayed logging core code The delayed logging code only changes in-memory structures and as such can be enabled and disabled with a mount option. Add the mount option and emit a warning that this is an experimental feature that should not be used in production yet. We also need infrastructure to track committed items that have not yet been written to the log. This is what the Committed Item List (CIL) is for. The log item also needs to be extended to track the current log vector, the associated memory buffer and it's location in the Commit Item List. Extend the log item and log vector structures to enable this tracking. To maintain the current log format for transactions with delayed logging, we need to introduce a checkpoint transaction and a context for tracking each checkpoint from initiation to transaction completion. This includes adding a log ticket for tracking space log required/used by the context checkpoint. To track all the changes we need an io vector array per log item, rather than a single array for the entire transaction. Using the new log vector structure for this requires two passes - the first to allocate the log vector structures and chain them together, and the second to fill them out. This log vector chain can then be passed to the CIL for formatting, pinning and insertion into the CIL. Formatting of the log vector chain is relatively simple - it's just a loop over the iovecs on each log vector, but it is made slightly more complex because we re-write the iovec after the copy to point back at the memory buffer we just copied into. This code also needs to pin log items. If the log item is not already tracked in this checkpoint context, then it needs to be pinned. Otherwise it is already pinned and we don't need to pin it again. The only other complexity is calculating the amount of new log space the formatting has consumed. This needs to be accounted to the transaction in progress, and the accounting is made more complex becase we need also to steal space from it for log metadata in the checkpoint transaction. Calculate all this at insert time and update all the tickets, counters, etc correctly. Once we've formatted all the log items in the transaction, attach the busy extents to the checkpoint context so the busy extents live until checkpoint completion and can be processed at that point in time. Transactions can then be freed at this point in time. Now we need to issue checkpoints - we are tracking the amount of log space used by the items in the CIL, so we can trigger background checkpoints when the space usage gets to a certain threshold. Otherwise, checkpoints need ot be triggered when a log synchronisation point is reached - a log force event. Because the log write code already handles chained log vectors, writing the transaction is trivial, too. Construct a transaction header, add it to the head of the chain and write it into the log, then issue a commit record write. Then we can release the checkpoint log ticket and attach the context to the log buffer so it can be called during Io completion to complete the checkpoint. We also need to allow for synchronising multiple in-flight checkpoints. This is needed for two things - the first is to ensure that checkpoint commit records appear in the log in the correct sequence order (so they are replayed in the correct order). The second is so that xfs_log_force_lsn() operates correctly and only flushes and/or waits for the specific sequence it was provided with. To do this we need a wait variable and a list tracking the checkpoint commits in progress. We can walk this list and wait for the checkpoints to change state or complete easily, an this provides the necessary synchronisation for correct operation in both cases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:03 -05:00
Dave Chinner	ed3b4d6cdc	xfs: Improve scalability of busy extent tracking When we free a metadata extent, we record it in the per-AG busy extent array so that it is not re-used before the freeing transaction hits the disk. This array is fixed size, so when it overflows we make further allocation transactions synchronous because we cannot track more freed extents until those transactions hit the disk and are completed. Under heavy mixed allocation and freeing workloads with large log buffers, we can overflow this array quite easily. Further, the array is sparsely populated, which means that inserts need to search for a free slot, and array searches often have to search many more slots that are actually used to check all the busy extents. Quite inefficient, really. To enable this aspect of extent freeing to scale better, we need a structure that can grow dynamically. While in other areas of XFS we have used radix trees, the extents being freed are at random locations on disk so are better suited to being indexed by an rbtree. So, use a per-AG rbtree indexed by block number to track busy extents. This incures a memory allocation when marking an extent busy, but should not occur too often in low memory situations. This should scale to an arbitrary number of extents so should not be a limitation for features such as in-memory aggregation of transactions. However, there are still situations where we can't avoid allocating busy extents (such as allocation from the AGFL). To minimise the overhead of such occurences, we need to avoid doing a synchronous log force while holding the AGF locked to ensure that the previous transactions are safely on disk before we use the extent. We can do this by marking the transaction doing the allocation as synchronous rather issuing a log force. Because of the locking involved and the ordering of transactions, the synchronous transaction provides the same guarantees as a synchronous log force because it ensures that all the prior transactions are already on disk when the synchronous transaction hits the disk. i.e. it preserves the free->allocate order of the extent correctly in recovery. By doing this, we avoid holding the AGF locked while log writes are in progress, hence reducing the length of time the lock is held and therefore we increase the rate at which we can allocate and free from the allocation group, thereby increasing overall throughput. The only problem with this approach is that when a metadata buffer is marked stale (e.g. a directory block is removed), then buffer remains pinned and locked until the log goes to disk. The issue here is that if that stale buffer is reallocated in a subsequent transaction, the attempt to lock that buffer in the transaction will hang waiting the log to go to disk to unlock and unpin the buffer. Hence if someone tries to lock a pinned, stale, locked buffer we need to push on the log to get it unlocked ASAP. Effectively we are trading off a guaranteed log force for a much less common trigger for log force to occur. Ideally we should not reallocate busy extents. That is a much more complex fix to the problem as it involves direct intervention in the allocation btree searches in many places. This is left to a future set of modifications. Finally, now that we track busy extents in allocated memory, we don't need the descriptors in the transaction structure to point to them. We can replace the complex busy chunk infrastructure with a simple linked list of busy extents. This allows us to remove a large chunk of code, making the overall change a net reduction in code size. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:34:00 -05:00
Dave Chinner	955833cf2a	xfs: make the log ticket ID available outside the log infrastructure The ticket ID is needed to uniquely identify transactions when doing busy extent matching. Delayed logging changes the lifecycle of busy extents with respect to the transaction structure lifecycle. Hence we can no longer use the transaction structure as a means of determining the owner of the busy extent as it may be freed and reused while the busy extent is still active. This commit provides the infrastructure to access the xlog_tid_t held in the ticket from a transaction handle. This avoids the need for callers to peek into the transaction and log structures to find this out. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:52 -05:00
Dave Chinner	169a7b078e	xfs: clean up log ticket overrun debug output Push the error message output when a ticket overrun is detected into the ticket printing functions. Also remove the debug version of the code as the production version will still panic just as effectively on a debug kernel via the panic mask being set. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:46 -05:00
Dave Chinner	c11554104f	xfs: Clean up XFS_BLI_* flag namespace Clean up the buffer log format (XFS_BLI_) flags because they have a polluted namespace. They XFS_BLI_ prefix is used for both in-memory and on-disk flag feilds, but have overlapping values for different flags. Rename the buffer log format flags to use the XFS_BLF_ prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed flags. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:39 -05:00
Dave Chinner	64fc35de60	xfs: modify buffer item reference counting The buffer log item reference counts used to take referenceѕ for every transaction, similar to the pin counting. This is symmetric (like the pin/unpin) with respect to transaction completion, but with dleayed logging becomes assymetric as the pinning becomes assymetric w.r.t. transaction completion. To make both cases the same, allow the buffer pinning to take a reference to the buffer log item and always drop the reference the transaction has on it when being unlocked. This is balanced correctly because the unpin operation always drops a reference to the log item. Hence reference counting becomes symmetric w.r.t. item pinning as well as w.r.t active transactions and as a result the reference counting model remain consistent between normal and delayed logging. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:31 -05:00
Dave Chinner	3383ca5780	xfs: allow log ticket allocation to take allocation flags Delayed logging currently requires ticket allocation to succeed, so we need to be able to sleep on allocation. It also should not allow memory allocation to recurse into the filesystem. hence we need to pass allocation flags directing the type of allocation the caller requires. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:17 -05:00
Dave Chinner	524ee36fa4	xfs: Don't reuse the same transaction ID for duplicated transactions. The transaction ID is written into the log as the unique identifier for transactions during recover. When duplicating a transaction, we reuse the log ticket, which means it has the same transaction ID as the previous transaction. Rather than regenerating a random transaction ID for the duplicated transaction, just add one to the current ID so that duplicated transaction can be easily spotted in the log and during recovery during problem diagnosis. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:10 -05:00
Linus Torvalds	f13771187b	Merge branch 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: uml: Pushdown the bkl from harddog_kern ioctl sunrpc: Pushdown the bkl from sunrpc cache ioctl sunrpc: Pushdown the bkl from ioctl autofs4: Pushdown the bkl from ioctl uml: Convert to unlocked_ioctls to remove implicit BKL ncpfs: BKL ioctl pushdown coda: Clean-up whitespace problems in pioctl.c coda: BKL ioctl pushdown drivers: Push down BKL into various drivers isdn: Push down BKL into ioctl functions scsi: Push down BKL into ioctl functions dvb: Push down BKL into ioctl functions smbfs: Push down BKL into ioctl function coda/psdev: Remove BKL from ioctl function um/mmapper: Remove BKL usage sn_hwperf: Kill BKL usage hfsplus: Push down BKL into ioctl function	2010-05-24 08:01:10 -07:00
Linus Torvalds	0163916f1d	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: confusion between kmap() and kmap_atomic() api exofs: Add default address_space_operations	2010-05-24 07:57:41 -07:00
Linus Torvalds	3e766fd41d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6: fat: convert to unlocked_ioctl fat: Cleanup nls_unload() usage fat: use pack_hex_byte() instead of custom one	2010-05-24 07:41:47 -07:00
Linus Torvalds	4fd5ec509b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: 9p: Optimize TCREATE by eliminating a redundant fid clone. 9p: cleanup: remove unneeded assignment 9p: Add mksock support fs/9p: Make sure we properly instantiate dentry. 9p: add 9P2000.L rename operation 9p: add 9P2000.L statfs operation 9p: VFS switches for 9p2000.L: VFS switches 9p: VFS switches for 9p2000.L: protocol and client changes	2010-05-24 07:41:13 -07:00
Linus Torvalds	6e188240eb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits) ceph: reuse mon subscribe message instead of allocated anew ceph: avoid resending queued message to monitor ceph: Storage class should be before const qualifier ceph: all allocation functions should get gfp_mask ceph: specify max_bytes on readdir replies ceph: cleanup pool op strings ceph: Use kzalloc ceph: use common helper for aborted dir request invalidation ceph: cope with out of order (unsafe after safe) mds reply ceph: save peer feature bits in connection structure ceph: resync headers with userland ceph: use ceph. prefix for virtual xattrs ceph: throw out dirty caps metadata, data on session teardown ceph: attempt mds reconnect if mds closes our session ceph: clean up send_mds_reconnect interface ceph: wait for mds OPEN reply to indicate reconnect success ceph: only send cap releases when mds is OPEN\|HUNG ceph: dicard cap releases on mds restart ceph: make mon client statfs handling more generic ceph: drop src address(es) from message header [new protocol feature] ...	2010-05-24 07:37:52 -07:00
Steven Whitehouse	7df0e0397b	GFS2: Fix permissions checking for setflags ioctl() We should be checking for the ownership of the file for which flags are being set, rather than just for write access. Reported-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2010-05-24 14:36:48 +01:00
Jan Kara	8f45c33dec	ufs: Remove dead quota code UFS quota is non-functional at least since 2.6.12 because dq_op was set to NULL. Since the filesystem exists mainly to allow cooperation with Solaris and quota format isn't standard, just remove the dead code. CC: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-24 14:10:19 +02:00
Jan Kara	3635046281	udf: Remove dead quota code Quota on UDF is non-functional at least since 2.6.16 (I'm too lazy to do more archeology) because it does not provide .quota_write and .quota_read functions and thus quotaon(8) just returns EINVAL. Since nobody complained for all those years and quota support is not even in UDF standard just nuke it. Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-24 14:10:19 +02:00
Christoph Hellwig	287a80958c	quota: rename default quotactl methods to dquot_ Follow the dquot_* style used elsewhere in dquot.c. [Jan Kara: Fixed up missing conversion of ext2] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-24 14:10:17 +02:00
Christoph Hellwig	123e9caf1e	quota: explicitly set ->dq_op and ->s_qcop Only set the quota operation vectors if the filesystem actually supports quota instead of doing it for all filesystems in alloc_super(). [Jan Kara: Export dquot_operations and vfs_quotactl_ops] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-24 14:10:17 +02:00
Christoph Hellwig	307ae18a56	quota: drop remount argument to ->quota_on and ->quota_off Remount handling has fully moved into the filesystem, so all this is superflous now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-24 14:09:12 +02:00
Christoph Hellwig	e0ccfd959c	quota: move unmount handling into the filesystem Currently the VFS calls into the quotactl interface for unmounting filesystems. This means filesystems with their own quota handling can't easily distinguish between user-space originating quotaoff and an unount. Instead move the responsibily of the unmount handling into the filesystem to be consistent with all other dquot handling. Note that we do call dquot_disable a lot later now, e.g. after a sync_filesystem. But this is fine as the quota code does all its writes via blockdev's mapping and that is synced even later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-24 14:09:12 +02:00
Christoph Hellwig	0f0dd62fdd	quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers Instead of having wrappers in the VFS namespace export the dquot_suspend and dquot_resume helpers directly. Also rename vfs_quota_disable to dquot_disable while we're at it. [Jan Kara: Moved dquot_suspend to quotaops.h and made it inline] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-24 14:06:40 +02:00
Christoph Hellwig	c79d967de3	quota: move remount handling into the filesystem Currently do_remount_sb calls into the dquot code to tell it about going from rw to ro and ro to rw. Move this code into the filesystem to not depend on the dquot code in the VFS - note ocfs2 already ignores these calls and handles remount by itself. This gets rid of overloading the quotactl calls and allows to unify the VFS and XFS codepaths in that area later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-24 14:06:39 +02:00
Jan Kara	eea7feb072	ocfs2: Fix use after free on remount read-only We also have to cancel quota syncing thread on remount read only because at that moment quota is being turned off. Otherwise quota syncing thread will try to access already freed quota structures. Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-24 14:06:39 +02:00
Phillip Lougher	5c80f5aa40	squashfs: fix name reading in squashfs_xattr_get Only read potentially matching names into the target buffer, all obviously non matching names don't need to be read into the target buffer. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2010-05-23 08:27:42 +01:00
Phillip Lougher	f6db25a876	squashfs: constify xattr handlers Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2010-05-23 03:35:05 +01:00
Venkateswararao Jujjuri	6d27e64d74	9p: Optimize TCREATE by eliminating a redundant fid clone. This patch removes a redundant fid clone on the directory fid and hence reduces a server transaction while creating new filesystem object. Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-05-22 12:39:02 -05:00
Dan Carpenter	fe5bd0736b	9p: cleanup: remove unneeded assignment We never use "v9ses" and so we can remove it. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-05-22 12:34:12 -05:00
Venkateswararao Jujjuri	75cc5c9b82	9p: Add mksock support Without this patch, an attempt to mksock will get an EINVAL. Before this patch: [root@localhost 1dir]# mksock mysock mksock: error making mysock: Invalid argument With this patch: [root@localhost 1dir]# mksock mysock [root@localhost 1dir]# ls -l mysock s--------- 1 root root 0 2010-03-31 17:44 mysock Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-05-22 12:34:12 -05:00
Aneesh Kumar K.V	85e0df240e	fs/9p: Make sure we properly instantiate dentry. For lookup if we get ENOENT error from the server we still instantiate the dentry. We need to make sure we have dentry operations set in that case so that a later dput on the dentry does the expected. Without the patch we get the below error #ln -sf abc abclink ln: creating symbolic link `abclink': No such file or directory Now on the host do $ touch abclink Guest now gives ENOENT error. # ls ls: cannot access abclink: No such file or directory Debugged-by:Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-05-22 12:34:11 -05:00
Frederic Weisbecker	3663df70c0	autofs4: Pushdown the bkl from ioctl Pushdown the bkl to autofs4_root_ioctl. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ian Kent <raven@themaw.net> Cc: Autofs <autofs@linux.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de>	2010-05-22 17:44:18 +02:00
Linus Torvalds	e8bebe2f71	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits) fix handling of offsets in cris eeprom.c, get rid of fake on-stack files get rid of home-grown mutex in cris eeprom.c switch ecryptfs_write() to struct inode , kill on-stack fake files switch ecryptfs_get_locked_page() to struct inode simplify access to ecryptfs inodes in ->readpage() and friends AFS: Don't put struct file on the stack Ban ecryptfs over ecryptfs logfs: replace inode uid,gid,mode initialization with helper function ufs: replace inode uid,gid,mode initialization with helper function udf: replace inode uid,gid,mode init with helper ubifs: replace inode uid,gid,mode initialization with helper function sysv: replace inode uid,gid,mode initialization with helper function reiserfs: replace inode uid,gid,mode initialization with helper function ramfs: replace inode uid,gid,mode initialization with helper function omfs: replace inode uid,gid,mode initialization with helper function bfs: replace inode uid,gid,mode initialization with helper function ocfs2: replace inode uid,gid,mode initialization with helper function nilfs2: replace inode uid,gid,mode initialization with helper function minix: replace inode uid,gid,mode init with helper ext4: replace inode uid,gid,mode init with helper ... Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)	2010-05-21 19:37:45 -07:00
Sage Weil	240ed68eb5	ceph: reuse mon subscribe message instead of allocated anew Use the same message, allocated during startup. No need to reallocate a new one each time around (and potentially ENOMEM). Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-21 16:26:11 -07:00
Al Viro	48c1e44ace	switch ecryptfs_write() to struct inode *, kill on-stack fake files Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:28 -04:00
Al Viro	02bd97997a	switch ecryptfs_get_locked_page() to struct inode * Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:28 -04:00
Al Viro	bef5bc2464	simplify access to ecryptfs inodes in ->readpage() and friends we can get to them from page->mapping->host, no need to mess with file. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:28 -04:00
Al Viro	f6d335c08d	AFS: Don't put struct file on the stack Don't put struct file on the stack as it takes up quite a lot of space and violates lifetime rules for struct file. Rather than calling afs_readpage() indirectly from the directory routines by way of read_mapping_page(), split afs_readpage() to have afs_page_filler() that's given a key instead of a file and call read_cache_page(), specifying the new function directly. Use it in afs_readpages() as well. Also make use of this in afs_mntpt_check_symlink() too for the same reason. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com>	2010-05-21 18:31:28 -04:00
Al Viro	4403158ba2	Ban ecryptfs over ecryptfs This is a seriously simplified patch from Eric Sandeen; copy of rationale follows: === mounting stacked ecryptfs on ecryptfs has been shown to lead to bugs in testing. For crypto info in xattr, there is no mechanism for handling this at all, and for normal file headers, we run into other trouble: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffffa015b0b3>] ecryptfs_d_revalidate+0x43/0xa0 [ecryptfs] ... There doesn't seem to be any good usecase for this, so I'd suggest just disallowing the configuration. Based on a patch originally, I believe, from Mike Halcrow. === Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:27 -04:00
Al Viro	ab9a79b966	logfs: replace inode uid,gid,mode initialization with helper function Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:27 -04:00
Dmitry Monakhov	be8ded5974	ufs: replace inode uid,gid,mode initialization with helper function Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:27 -04:00
Dmitry Monakhov	a6c5a0342a	udf: replace inode uid,gid,mode init with helper Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:27 -04:00
Dmitry Monakhov	abf5d08aca	ubifs: replace inode uid,gid,mode initialization with helper function Acked-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:26 -04:00
Dmitry Monakhov	85640bd9d4	sysv: replace inode uid,gid,mode initialization with helper function Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:26 -04:00
Dmitry Monakhov	04b7ed0d33	reiserfs: replace inode uid,gid,mode initialization with helper function Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:26 -04:00
Dmitry Monakhov	454abafe9d	ramfs: replace inode uid,gid,mode initialization with helper function - seems what ramfs_get_inode is only locally, make it static. [AV: the hell it is; it's used by shmem, so shmem needed conversion too and no, that function can't be made static] Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:26 -04:00
Dmitry Monakhov	6a9e652c88	omfs: replace inode uid,gid,mode initialization with helper function Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:25 -04:00
Dmitry Monakhov	e6ecdc70fb	bfs: replace inode uid,gid,mode initialization with helper function Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:25 -04:00
Dmitry Monakhov	75fe0a2477	ocfs2: replace inode uid,gid,mode initialization with helper function Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:25 -04:00
Dmitry Monakhov	73459dcc67	nilfs2: replace inode uid,gid,mode initialization with helper function Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:25 -04:00
Dmitry Monakhov	9eed1fb721	minix: replace inode uid,gid,mode init with helper - also redesign minix_new_inode interface Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:24 -04:00
Dmitry Monakhov	b10b852090	ext4: replace inode uid,gid,mode init with helper Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:24 -04:00
Dmitry Monakhov	aab99c2c26	ext3: replace inode uid,gid,mode init with helper Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:24 -04:00
Dmitry Monakhov	ffba102d75	ext2: replace inode uid,gid,mode init with helper Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:24 -04:00
Dmitry Monakhov	e00117f14f	exofs: replace inode uid,gid,mode initialization with helper function Ack-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:23 -04:00
Dmitry Monakhov	ecc11fabf7	btrfs: replace inode uid,gid,mode initialization with helper function Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:23 -04:00
Dmitry Monakhov	319b2be49e	jfs: replace inode uid,gid,mode init with helper Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:23 -04:00
Dmitry Monakhov	217f206d68	9p: replace inode uid,gid,mode initialization with helper function Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:23 -04:00
Dmitry Monakhov	a1bd120d13	vfs: Add inode uid,gid,mode init helper Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:22 -04:00
H Hartley Sweeten	52957fe1c7	fs-writeback.c: bitfields should be unsigned This fixes sparse noise: error: dubious one-bit signed bitfield Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:22 -04:00
Huang Shijie	9a2296832c	namei.c : update mnt when it needed update the mnt of the path when it is not equal to the new one. Signed-off-by: Huang Shijie <shijie8@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:22 -04:00
Roland Dreier	51ee049e77	vfs: add lockdep annotation to s_vfs_rename_key for ecryptfs > ============================================= > [ INFO: possible recursive locking detected ] > 2.6.31-2-generic #14~rbd3 > --------------------------------------------- > firefox-3.5/4162 is trying to acquire lock: > (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0 > > but task is already holding lock: > (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0 > > other info that might help us debug this: > 3 locks held by firefox-3.5/4162: > #0: (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0 > #1: (&sb->s_type->i_mutex_key#11/1){+.+.+.}, at: [<ffffffff81139d5a>] lock_rename+0x6a/0xf0 > #2: (&sb->s_type->i_mutex_key#11/2){+.+.+.}, at: [<ffffffff81139d6f>] lock_rename+0x7f/0xf0 > > stack backtrace: > Pid: 4162, comm: firefox-3.5 Tainted: G C 2.6.31-2-generic #14~rbd3 > Call Trace: > [<ffffffff8108ae74>] print_deadlock_bug+0xf4/0x100 > [<ffffffff8108ce26>] validate_chain+0x4c6/0x750 > [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430 > [<ffffffff8108d585>] lock_acquire+0xa5/0x150 > [<ffffffff81139d31>] ? lock_rename+0x41/0xf0 > [<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0 > [<ffffffff81139d31>] ? lock_rename+0x41/0xf0 > [<ffffffff81139d31>] ? lock_rename+0x41/0xf0 > [<ffffffff8120eaf9>] ? ecryptfs_rename+0x99/0x170 > [<ffffffff81552b36>] mutex_lock_nested+0x46/0x60 > [<ffffffff81139d31>] lock_rename+0x41/0xf0 > [<ffffffff8120eb2a>] ecryptfs_rename+0xca/0x170 > [<ffffffff81139a9e>] vfs_rename_dir+0x13e/0x160 > [<ffffffff8113ac7e>] vfs_rename+0xee/0x290 > [<ffffffff8113c212>] ? __lookup_hash+0x102/0x160 > [<ffffffff8113d512>] sys_renameat+0x252/0x280 > [<ffffffff81133eb4>] ? cp_new_stat+0xe4/0x100 > [<ffffffff8101316a>] ? sysret_check+0x2e/0x69 > [<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190 > [<ffffffff8113d55b>] sys_rename+0x1b/0x20 > [<ffffffff81013132>] system_call_fastpath+0x16/0x1b The trace above is totally reproducible by doing a cross-directory rename on an ecryptfs directory. The issue seems to be that sys_renameat() does lock_rename() then calls into the filesystem; if the filesystem is ecryptfs, then ecryptfs_rename() again does lock_rename() on the lower filesystem, and lockdep can't tell that the two s_vfs_rename_mutexes are different. It seems an annotation like the following is sufficient to fix this (it does get rid of the lockdep trace in my simple tests); however I would like to make sure I'm not misunderstanding the locking, hence the CC list... Signed-off-by: Roland Dreier <rdreier@cisco.com> Cc: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Cc: Dustin Kirkland <kirkland@canonical.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:22 -04:00
Cesar Eduardo Barros	cc9106247d	fs/partitions: use ADDPART_FLAG_RAID instead of magic number ADDPART_FLAG_RAID was introduced in commit `d18d768`, and most places were converted to use it instead of a hardcoded value. However, some places seem to have been missed. Change all of them to the symbolic names via the following semantic patch: @@ struct parsed_partitions *state; expression E; @@ ( - state->parts[E].flags = 1 + state->parts[E].flags = ADDPART_FLAG_RAID \| - state->parts[E].flags \|= 1 + state->parts[E].flags \|= ADDPART_FLAG_RAID \| - state->parts[E].flags = 2 + state->parts[E].flags = ADDPART_FLAG_WHOLEDISK \| - state->parts[E].flags \|= 2 + state->parts[E].flags \|= ADDPART_FLAG_WHOLEDISK ) Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:21 -04:00
Christoph Hellwig	8018ab0574	sanitize vfs_fsync calling conventions Now that the last user passing a NULL file pointer is gone we can remove the redundant dentry argument and associated hacks inside vfs_fsynmc_range. The next step will be removig the dentry argument from ->fsync, but given the luck with the last round of method prototype changes I'd rather defer this until after the main merge window. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:21 -04:00
Christoph Hellwig	e970a573ce	nfsd: open a file descriptor for fsync in nfs4 recovery Instead of just looking up a path use do_filp_open to get us a file structure for the nfs4 recovery directory. This allows us to get rid of the last non-standard vfs_fsync caller with a NULL file pointer. [AV: should be using fput(), not filp_close()] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:21 -04:00
Richard Kennedy	2e147f1ef7	fs: inode.c use atomic_inc_return in __iget Using atomic_inc_return in __iget(struct inode *inode) makes the intent of this code clearer and generates less code on processors that have this operation. On x86_64 this patch reduces the text size of inode.o by 12 bytes. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> ---- patch against 2.6.34-rc7 compiled & tested on x86_64 AMD X2 I've been running with this patch applied for several weeks with no obvious problems. regards Richard Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:21 -04:00
Eric Paris	a7cf4145bb	anon_inode: set S_IFREG on the anon_inode anon_inode_mkinode() sets inode->i_mode = S_IRUSR \| S_IWUSR; This means that (inode->i_mode & S_IFMT) == 0. This trips up some SELinux code that needs to determine if a given inode is a regular file, a directory, etc. The easiest solution is to just make sure that the anon_inode also sets S_IFREG. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:20 -04:00
Stephen Hemminger	b7bb0a1291	gfs: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:20 -04:00
Stephen Hemminger	537d81ca7c	ocfs: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:20 -04:00
Stephen Hemminger	365f0cb9d2	jffs2: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:20 -04:00
Stephen Hemminger	46e58764f0	xfs: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:19 -04:00
Stephen Hemminger	94d09a98cd	reiserfs: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:19 -04:00
Stephen Hemminger	11e2752807	ext4: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:19 -04:00
Stephen Hemminger	d1f21049f9	ext3: constify xattr handlers Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:19 -04:00
Stephen Hemminger	749c72efa4	ext2: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:18 -04:00
Stephen Hemminger	f01cbd3f81	btrfs: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:18 -04:00
Stephen Hemminger	bb4354538e	fs: xattr_handler table should be const The entries in xattr handler table should be immutable (ie const) like other operation tables. Later patches convert common filesystems. Uncoverted filesystems will still work, but will generate a compiler warning. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:18 -04:00
Josef Bacik	18e9e5104f	Introduce freeze_super and thaw_super for the fsfreeze ioctl Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then letting it do all the work. But freezing is more of an fs thing, and doesn't really have much to do with the bdev at all, all the work gets done with the super. In btrfs we do not populate s_bdev, since we can have multiple bdev's for one fs and setting s_bdev makes removing devices from a pool kind of tricky. This means that freezing a btrfs filesystem fails, which causes us to corrupt with things like tux-on-ice which use the fsfreeze mechanism. So instead of populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs freezing stuff into freeze_super and thaw_super. These just take the super_block that we're freezing and does the appropriate work. It's basically just copy and pasted from freeze_bdev. I've then converted freeze_bdev over to use the new super helpers. I've tested this with ext4 and btrfs and verified everything continues to work the same as before. The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if the fs is already frozen. I thought this was a better solution than adding a freeze counter to the super_block, but if everybody hates this idea I'm open to suggestions. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:18 -04:00
Al Viro	e1e46bf186	Trim includes in fs/super.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:17 -04:00
Al Viro	d3f2147307	Move grabbing s_umount to callers of grab_super() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:17 -04:00
Al Viro	7ed1ee6118	Take statfs variants to fs/statfs.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:17 -04:00
Al Viro	01a05b337a	new helper: iterate_supers() ... and switch the simple "loop over superblocks and do something" loops to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:16 -04:00
Al Viro	35cf7ba0b4	Bury __put_super_and_need_restart() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:16 -04:00
Al Viro	79893c17b4	fix prune_dcache()/umount() race ... and get rid of the last __put_super_and_need_restart() caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:16 -04:00
Al Viro	df40c01a92	In get_super() and user_get_super() restarts are unconditional If superblock had been still alive, we would've returned it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:16 -04:00
Al Viro	1494583de5	fix get_active_super()/umount() race This one needs restarts... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:15 -04:00
Al Viro	e7fe0585ca	fix do_emergency_remount()/umount() races need list_for_each_entry_safe() here. Original didn't even have restart logics, so if you race with umount() it blew up. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:15 -04:00
Al Viro	6754af6464	Convert simple loops over superblocks to list_for_each_entry_safe Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:15 -04:00
Al Viro	8edd64bd60	get rid of restarts in sync_filesystems() At the same time we can kill s_need_restart and local mutex in there. __put_super() made public for a while; will be gone later. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:15 -04:00
Al Viro	551de6f34d	Leave superblocks on s_list until the end We used to remove from s_list and s_instances at the same time. So let's not do the former and skip superblocks that have empty s_instances in the loops over s_list. The next step, of course, will be to get rid of rescan logics in those loops. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:14 -04:00
Al Viro	1712ac8fda	Saner locking around deactivate_super() Make sure that s_umount is acquired before we drop the final active reference; we still have the fast path (atomic_dec_unless) and we have gotten rid of the window between the moment when s_active hits zero and s_umount is acquired. Which simplifies the living hell out of grab_super() and inotify pin_to_kill() stuff. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:14 -04:00
Al Viro	b20bd1a5e7	get rid of S_BIAS use atomic_inc_not_zero(&sb->s_active) instead of playing games with checking ->s_count > S_BIAS Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:14 -04:00
Al Viro	389b8be6ef	get rid of open-coded grab_super() in get_active_super() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:14 -04:00
Al Viro	3981f2e2a0	ceph: should use deactivate_locked_super() on failure exits Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:13 -04:00
Al Viro	2ccde7c631	Clean ecryptfs ->get_sb() up Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:13 -04:00
Al Viro	decabd6650	fix a couple of ecryptfs leaks First of all, get_sb_nodev() grabs anon dev minor and we never free it in ecryptfs ->kill_sb(). Moreover, on one of the failure exits in ecryptfs_get_sb() we leak things - it happens before we set ->s_root and ->put_super() won't be called in that case. Solution: kill ->put_super(), do all that stuff in ->kill_sb(). And use kill_anon_sb() instead of generic_shutdown_super() to deal with anon dev leak. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:13 -04:00
Al Viro	894680710d	Simplify devpts_get_sb() failure exits postpone simple_set_mnt() until we know we won't fail. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:12 -04:00
Christoph Hellwig	a135aa2cd7	remove incorrect comment in do_emergency_remount Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:12 -04:00
Al Viro	13e3c5e5b9	clean DCACHE_CANT_MOUNT in d_delete() We set the "it's dead, don't mount on it" flag _and_ do not remove it if we turn the damn thing negative and leave it around. And if it goes positive afterwards, well... Fortunately, there's only one place where that needs to be caught: only d_delete() can turn the sucker negative without immediately freeing it; all other places that can lead to ->d_iput() call are followed by unconditionally freeing struct dentry in question. So the fix is obvious: Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16014 Reported-by: Adam Tkac <vonsch@gmail.com> Tested-by: Adam Tkac <vonsch@gmail.com> Cc: <stable@kernel.org> [2.6.34.x] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:12 -04:00
Sage Weil	970690012c	ceph: avoid resending queued message to monitor The auth_reply handler will (re)send any pending requests. For the initial mon authenticate phase, that's correct, but when a auth ticket renewal races with an in-flight request, we may resend a request message that is already in flight. Avoid this by revoking the message before sending it. We should also avoid resending requests at all during ticket renewal; that will come soon. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-21 15:01:22 -07:00
Tobias Klauser	9e32789f63	ceph: Storage class should be before const qualifier The C99 specification states in section 6.11.5: The placement of a storage-class specifier other than at the beginning of the declaration specifiers in a declaration is an obsolescent feature. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-21 15:01:21 -07:00
Sripathi Kodi	4681dbdacb	9p: add 9P2000.L rename operation I made a V2 of this patch on top of my patches for VFS switches. All the changes were due to change in some offsets. rename - change name of file or directory size[4] Trename tag[2] fid[4] newdirfid[4] name[s] size[4] Rrename tag[2] The rename message is used to change the name of a file, possibly moving it to a new directory. The 9P wstat message can only rename a file within the same directory. Signed-off-by: Jim Garlick <garlick@llnl.gov> Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-05-21 16:44:34 -05:00
Sripathi Kodi	bda8e77520	9p: add 9P2000.L statfs operation I made a V2 of this patch on top of my patches for VFS switches. The change was adding v9fs_statfs pointer to v9fs_super_ops_dotl instead of v9fs_super_ops. statfs - get file system statistics size[4] Tstatfs tag[2] fid[4] size[4] Rstatfs tag[2] type[4] bsize[4] blocks[8] bfree[8] bavail[8] files[8] ffree[8] fsid[8] namelen[4] The statfs message is used to request file system information returned by the statfs(2) system call, which is used by df(1) to report file system and disk space usage. Signed-off-by: Jim Garlick <garlick@llnl.gov> Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-05-21 16:44:33 -05:00
Sripathi Kodi	9b6533c9b3	9p: VFS switches for 9p2000.L: VFS switches Implements VFS switches for 9p2000.L protocol. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2010-05-21 16:44:33 -05:00

... 4 5 6 7 8 ...

18668 Commits