linux-next

mirror of https://github.com/edk2-porting/linux-next.git synced 2024-12-21 03:33:59 +08:00

Author	SHA1	Message	Date
Eric Sandeen	0e6360274f	btrfs: add missing break in btrfs_print_leaf() I don't think that BTRFS_DEV_EXTENT_KEY is supposed to fall through to BTRFS_DEV_STATS_KEY ... Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:20 -05:00
Eric Sandeen	1c697d4acc	btrfs: annotate intentional switch case fallthroughs This keeps static checkers happy. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:19 -05:00
Eric Sandeen	aa43a17c21	btrfs: handle null fs_info in btrfs_panic() At least backref_tree_panic() can apparently pass in a null fs_info, so handle that in __btrfs_panic to get the message out on the console. The btrfs_panic macro also uses fs_info, but that's largely pointless; it's testing to see if BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is not set. But if it were set, __btrfs_panic() would have, well, paniced and we wouldn't be here, testing it! So just BUG() at this point. And since we only use fs_info once now, just use it directly. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:18 -05:00
Eric Sandeen	5a01604783	btrfs: remove unused fs_info from btrfs_decode_error() Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:17 -05:00
Eric Sandeen	d1d3cd27a3	btrfs: list_entry can't return NULL No need to test the result, we can't get a null pointer from list_entry() Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:15 -05:00
Eric Sandeen	b4c6f7b75c	btrfs: remove unused fd in btrfs_ioctl_send() All we do is set it to NULL and test it :) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:14 -05:00
Josef Bacik	96f1bb5777	Btrfs: do not overcommit if we don't have enough space for global rsv Because of how little we allocate chunks now we can get really tight on metadata space before we will allocate a new chunk. This resulted in being unable to add device extents when allocating a new metadata chunk as we did not have enough space. This is because we were allowed to overcommit too much metadata without actually making sure we had enough space to make allocations. The idea behind overcommit is that we are allowed to say "sure you can have that reservation" when most of the free space is occupied by reservations, not actual allocations. But in this case where a majority of the total space is in use by actual allocations we can screw ourselves by not being able to make real allocations when it matters. So make sure we have enough real space for our global reserve, and if not then don't allow overcommitting. Thanks, Reported-and-tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:13 -05:00
Josef Bacik	0f5d42b287	Btrfs: remove extent mapping if we fail to add chunk I got a double free error when unmounting a file system that failed to add a chunk during its operation. This is because we will kfree the mapping that we created but leave the extent_map in the em_tree for chunks. So to fix this just remove the extent_map when we error out so we don't run into this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:11 -05:00
Josef Bacik	0448748849	Btrfs: fix chunk allocation error handling If we error out allocating a dev extent we will have already created the block group and such which will cause problems since the allocator may have tried to allocate out of the block group that no longer exists. This will cause BUG_ON()'s in the bio submission path. This also makes a failure to allocate a dev extent a non-abort error, we will just clean up the dev extents we did allocate and exit. Now if we fail to delete the dev extents we will abort since we can't have half of the dev extents hanging around, but this will make us much less likely to abort. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:10 -05:00
Miao Xie	87533c4751	Btrfs: use bit operation for ->fs_state There is no lock to protect fs_info->fs_state, it will introduce some problems, such as the value may be covered by the other task when several tasks modify it. For example: Task0 - CPU0 Task1 - CPU1 mov %fs_state rax or $0x1 rax mov %fs_state rax or $0x2 rax mov rax %fs_state mov rax %fs_state The expected value is 3, but in fact, it is 2. Though this problem doesn't happen now (because there is only one flag currently), the code is error prone, if we add other flags, the above problem will happen to a certainty. Now we use bit operation for it to fix the above problem. In this way, we can make the code more robust and be easy to add new flags. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:09 -05:00
Miao Xie	de98ced9e7	Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits There is no lock to protect fs_info->avail_{data, metadata, system}_alloc_bits, it may introduce some problem, such as the wrong profile information, so we add a seqlock to protect them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:08 -05:00
Miao Xie	df0af1a57f	Btrfs: use the inode own lock to protect its delalloc_bytes We need not use a global lock to protect the delalloc_bytes of the inode, just use its own lock. In this way, we can reduce the lock contention and ->delalloc_lock will just protect delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:06 -05:00
Miao Xie	963d678b0f	Btrfs: use percpu counter for fs_info->delalloc_bytes fs_info->delalloc_bytes is accessed very frequently, so use percpu counter instead of the u64 variant for it to reduce the lock contention. This patch also fixed the problem that we access the variant without the lock protection.At worst, we would not flush the delalloc inodes, and just return ENOSPC error when we still have some free space in the fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:05 -05:00
Miao Xie	e2d845211e	Btrfs: use percpu counter for dirty metadata count ->dirty_metadata_bytes is accessed very frequently, so use percpu counter instead of the u64 variant to reduce the contention of the lock. This patch also fixed the problem that we access it without lock protection in __btrfs_btree_balance_dirty(), which may cause we skip the dirty pages flush. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:04 -05:00
Miao Xie	c018daecea	Btrfs: protect fs_info->alloc_start fs_info->alloc_start is a 64bits variant, can be accessed by multi-task, but it is not protected strictly, it can be changed while we are accessing it. On 32bit machine, we will get wrong value because we access it by two instructions.(In fact, it is also possible that the same problem happens on the 64bit machine, because the compiler may split the 64bit operation into two 32bit operation.) For example: Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning, then we remount and set ->alloc_start to 0x0000 0100 0000 0000. Task0 Task1 load high 32 bits set high 32 bits set low 32 bits load low 32 bits Task1 will get 0. This patch fixes this problem by using two locks to protect it fs_info->chunk_mutex sb->s_umount On the read side, we just need get one of these two locks, and on the write side, we must lock all of them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:02 -05:00
Miao Xie	8c6a3ee6db	Btrfs: add a comment for fs_info->max_inline Though ->max_inline is a 64bit variant, and may be accessed by multi-task, but it is just suggestive number, so we needn't add anything to protect fs_info->max_inline, just add a comment to explain wny we don't use a lock to protect it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:01 -05:00
Filipe Brandenburger	55e301fd57	Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h The header file will then be installed under /usr/include/linux so that userspace applications can refer to Btrfs ioctls by name and use the same structs used internally in the kernel. Signed-off-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:28 -05:00
Kusanagi Kouichi	82b22ac8f6	Btrfs: Check CAP_DAC_READ_SEARCH for BTRFS_IOC_INO_PATHS CAP_DAC_READ_SEARCH overrides read and search permission check on file and directory. It seems fit for BTRFS_IOC_INO_PATHS. Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:27 -05:00
Josef Bacik	fe5fafbebd	Revert "Btrfs: fix permissions of empty files not affected by umask" This reverts commit `2794ed013b`. Wasn't supposed to get used in btrfs_mknod, it was supposed to be in btrfs_create, which was done in commit `9185aa587b`. Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:26 -05:00
Miao Xie	5b947f1ba9	Btrfs: don't traverse the ordered operation list repeatedly btrfs_run_ordered_operations() needn't traverse the ordered operation list repeatedly, it is because the transaction commiter will invoke it again when there is no other writer in this transaction, it can ensure that no one can add new objects into the ordered operation list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:24 -05:00
Miao Xie	63607cc86a	Btrfs: traverse and flush the delalloc inodes once btrfs_start_delalloc_inodes() needn't traverse and flush the delalloc inodes repeatedly. It is because we can regard the data that the users write after we start delalloc inodes flush as the one which is after the delalloc inodes flush is done, and we can flush it next time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:23 -05:00
Miao Xie	eebc608406	Btrfs: check the return value of btrfs_run_ordered_operations() We forget to check the return value of btrfs_run_ordered_operations() when flushing all the pending stuffs, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:22 -05:00
Miao Xie	3edb2a68cb	Btrfs: check the return value of btrfs_start_delalloc_inodes() We forget to check the return value of btrfs_start_delalloc_inodes(), fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:21 -05:00
Miao Xie	e6ec716f0d	Btrfs: make raid attr array more readable The current code of raid attr arry is hard to understand and it is easy to introduce some problem if we modify the array. So I changed it and made it more readable. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:19 -05:00
Liu Bo	a1897fddd2	Btrfs: record first logical byte in memory This'd save us a rbtree search which may become expensive in large filesystem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:18 -05:00
Liu Bo	39f9d028c9	Btrfs: save us a read_lock This does not change the logic of code, but can save us a read_lock. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:17 -05:00
Liu Bo	51fab69347	Btrfs: use token to avoid times mapping extent buffer The API in tree log code has done sort of changes, and it proves that we can benifit from using token, so do the same thing here. function_graph tracer's timer shows that it costs nearly half time of before(39.788us -> 22.391us). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:15 -05:00
Liu Bo	dcfac4156f	Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay Argument 'trans' is not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:14 -05:00
Liu Bo	c53d613e52	Btrfs: kill unused argument of update_block_group Argument 'trans' is not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:13 -05:00
Liu Bo	f6373bf3dc	Btrfs: kill unused arguments of cache_block_group Argument 'trans' and 'root' are not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:11 -05:00
Liu Bo	17b85495cf	Btrfs: remove deprecated comments commit `d53ba47484` (Btrfs: use commit root when loading free space cache) has remove the deadlock check, and the related comments can be removed as well. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:10 -05:00
Josef Bacik	c6b305a89b	Btrfs: don't re-enter when allocating a chunk If we start running low on metadata space we will try to allocate a chunk, which could then try to allocate a chunk to add the device entry. The thing is we allocate a chunk before we try really hard to make the allocation, so we should be able to find space for the device entry. Add a flag to the trans handle so we know we're currently allocating a chunk so we can just bail out if we try to allocate another chunk. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:09 -05:00
Josef Bacik	2ab28f322f	Btrfs: wait on ordered extents at the last possible moment Since we don't actually copy the extent information from the source tree in the fast case we don't need to wait for ordered io to be completed in order to fsync, we just need to wait for the io to be completed. So when we're logging our file just attach all of the ordered extents to the log, and then when the log syncs just wait for IO_DONE on the ordered extents and then write the super. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:04 -05:00
Miao Xie	dfd79829b7	Btrfs: fix trivial error in btrfs_ioctl_resize() This patch fixes the following problem: - improper return value - unnecessary read-only check Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:44 -05:00
Miao Xie	4eee4fa4f8	Btrfs: use wrapper page_offset Use wrapper page_offset to get byte-offset into filesystem object for page. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:43 -05:00
Miao Xie	da633a4217	Btrfs: flush all dirty inodes if writeback can not start We may try to flush some dirty pages when there is no enough space to reserve. But it is possible that this operation fails, in order to get enough space to reserve successfully, we will sync all the delalloc file. This operation is safe, we needn't worry about the case that the filesystem goes from r/w to r/o. because the filesystem should guarantee all the dirty pages have been written into the disk after it becomes readonly, so the sync operation will do nothing if the filesystem is already readonly. Though it may waste lots of time, as a corner case, we needn't care. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:42 -05:00
Miao Xie	093486c453	Btrfs: make delayed ref lock logic more readable Locking and unlocking delayed ref mutex are in the different functions, and the name of lock functions is not uniform, so the readability is not so good, this patch optimizes the lock logic and makes it more readable. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:41 -05:00
Miao Xie	0e8c36a9fd	Btrfs: fix lots of orphan inodes when the space is not enough We're running into having 50-100 orphans left over with xfstests 83 because of ENOSPC when trying to start the transaction for the inode update. But in fact, it makes no sense in updating the inode for the new size while we're deleting the stupid thing. This patch fixes this problem. Reported-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:39 -05:00
Miao Xie	4ea41ce07d	Btrfs: cleanup similar code in delayed inode The delayed item commit code in several functions is similar, so cleanup it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:38 -05:00
Miao Xie	7892b5afe4	Btrfs: use common work instead of delayed work Since we do not want to delay the async transaction commit, we should use common work, not delayed work. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-02-20 09:36:37 -05:00
Miao Xie	7b5a1c5310	Btrfs: cleanup unnecessary clear when freeing a transaction or a trans handle We clear the transaction object and the trans handle when they are about to be freed, it is unnecessary, cleanup it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-02-20 09:36:35 -05:00
Miao Xie	78a6184a3f	Btrfs: use slabs for delayed reference allocation The delayed reference allocation is in the fast path of the IO, so use slabs to improve the speed of the allocation. And besides that, it can do check for leaked objects when the module is removed. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-02-20 09:36:34 -05:00
David Sterba	6f60cbd3ae	btrfs: access superblock via pagecache in scan_one_device btrfs_scan_one_device is calling set_blocksize() which can race with a concurrent process making dirty page cache pages. It can end up dropping dirty page cache pages on the floor, which isn't very nice when someone is just running btrfs dev scan to find filesystems on the box. Now that udev is registering btrfs devices as it discovers them, we can actually end up racing with our own mkfs program too. When this happens, we drop some of the important blocks written by mkfs. This commit changes scan_one_device to read the super out of the page cache instead of trying to use bread. This way we don't have to care about the blocksize of the device. This also drops the invalidate_bdev() call. It wasn't very polite to invalidate during the scan either. mkfs is putting the super into the page cache, there's no reason to invalidate at this point. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-15 16:57:47 -05:00
Arne Jansen	2a745b14bc	Btrfs: fix crash in log replay with qgroups enabled When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a sanity-check of the number of items against the maximum possible number. It calculates that number with the nodesize of fs_root. Unfortunately fs_root is not yet set at this stage. So instead use the nodesize from tree_root, which is already initialized. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-14 20:47:41 -05:00
Linus Torvalds	8d19514fad	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've got corner cases for updating i_size that ceph was hitting, error handling for quotas when we run out of space, a very subtle snapshot deletion race, a crash while removing devices, and one deadlock between subvolume creation and the sb_internal code (thanks lockdep)." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: move d_instantiate outside the transaction during mksubvol Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata Btrfs: fix possible stale data exposure Btrfs: fix missing i_size update Btrfs: fix race between snapshot deletion and getting inode Btrfs: fix missing release of the space/qgroup reservation in start_transaction() Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() Btrfs: do not merge logged extents if we've removed them from the tree btrfs: don't try to notify udev about missing devices	2013-02-08 12:06:46 +11:00
Chris Mason	1a65e24b0b	Btrfs: move d_instantiate outside the transaction during mksubvol Dave Sterba triggered a lockdep complaint about lock ordering between the sb_internal lock and the cleaner semaphore. btrfs_lookup_dentry() checks for orphans if we're looking up the inode for a subvolume, and subvolume creation is triggering the lookup with a transaction running. This commit moves the d_instantiate after the transaction closes. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-06 12:11:10 -05:00
Jan Schmidt	eb6b88d92c	Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata When btrfs_qgroup_reserve returned a failure, we were missing a counter operation for BTRFS_I(inode)->outstanding_extents++, leading to warning messages about outstanding extents and space_info->bytes_may_use != 0. Additionally, the error handling code didn't take into account that we dropped the inode lock which might require more cleanup. Luckily, all the cleanup code we need is already there and can be shared with reserve_metadata_bytes, which is exactly what this patch does. Reported-by: Lev Vainblat <lev@zadarastorage.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-06 09:24:40 -05:00
Chris Mason	24f8ebe918	Merge git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git for-chris into for-linus	2013-02-05 19:24:44 -05:00
Josef Bacik	59fe4f4197	Btrfs: fix possible stale data exposure We specifically do not update the disk i_size if there are ordered extents outstanding for any area between the current disk_i_size and our ordered extent so that we do not expose stale data. The problem is the check we have only checks if the ordered extent starts at or after the current disk_i_size, which doesn't take into account an ordered extent that starts before the current disk_i_size and ends past the disk_i_size. Fix this by checking if the extent ends past the disk_i_size. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:16 -05:00
Josef Bacik	5d1f40202b	Btrfs: fix missing i_size update If we have an ordered extent before the ordered extent we are currently completing that is after the current disk_i_size we will put our i_size update into that ordered extent so that we do not expose stale data. The problem is that if our disk i_size is updated past the previous ordered extent we won't update the i_size with the pending i_size update. So check the pending i_size update and if its above the current disk i_size we need to go ahead and try to update. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:14 -05:00
Liu Bo	6f1c36055f	Btrfs: fix race between snapshot deletion and getting inode While running snapshot testscript created by Mitch and David, the race between autodefrag and snapshot deletion can lead to corruption of dead_root list so that we can get crash on btrfs_clean_old_snapshots(). And besides autodefrag, scrub also does the same thing, ie. read root first and get inode. Here is the story(take autodefrag as an example): (1) when we delete a snapshot or subvolume, it will set its root's refs to zero and do a iput() on its own inode, and if this inode happens to be the only active in-meory one in root's inode rbtree, it will add itself to the global dead_roots list for later cleanup. (2) after (1), the autodefrag thread may read another inode for defrag and the inode is just in the deleted snapshot/subvolume, but all of these are without checking if the root is still valid(refs > 0). So the end up result is adding the deleted snapshot/subvolume's root to the global dead_roots list AGAIN. Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu. So all we need to do is to take the lock to protect 'read root and get inode', since we synchronize to wait for the rcu grace period before adding something to the global dead_roots list. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:13 -05:00
Miao Xie	843fcf3573	Btrfs: fix missing release of the space/qgroup reservation in start_transaction() When we fail to start a transaction, we need to release the reserved free space and qgroup space, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:11 -05:00
Miao Xie	0a3404dcff	Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() If the checks at the beginning of btrfs_file_aio_write() fail, we needn't decrease ->sync_writers, because we have not increased it. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:10 -05:00
Josef Bacik	222c81dc38	Btrfs: do not merge logged extents if we've removed them from the tree You can run into this problem where if somebody is fsyncing and writing out the existing extents you will have removed the extent map from the em tree, but it's still valid for the current fsync so we go ahead and write it. The problem is we unconditionally try to merge it back into the em tree, but if we've removed it from the em tree that will cause use after free problems. Fix this to only merge if we are still a part of the tree. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:03 -05:00
Chris Mason	0e4e026366	Merge branch 'for-linus' into raid56-experimental Conflicts: fs/btrfs/volumes.c Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 10:04:03 -05:00
Chris Mason	1f0905ec15	Btrfs: remove conflicting check for minimum number of devices in raid56 The device removal code was incorrectly checking against two different limits for raid5 and raid6. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 10:01:42 -05:00
Tomasz Torcz	10e78e3a8a	Btrfs: select XOR_BLOCKS in Kconfig The Btrfs raid56 uses the generic xor helpers. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 09:55:30 -05:00
Linus Torvalds	fe547d7714	Merge branch 'fix-max-write' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm fix from David Teigland: "Thanks to Jana who reported the problem and was able to test this fix so quickly." This fixes an incorrect size check that triggered for CONFIG_COMPAT whether the code was actually doing compat or not. The incorrect write size check broke userland (clvmd) when maximum resource name lengths are used. * 'fix-max-write' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: check the write size from user	2013-02-05 20:50:11 +11:00
Vyacheslav Dubeyko	a9bae18954	nilfs2: fix fix very long mount time issue There exists a situation when GC can work in background alone without any other filesystem activity during significant time. The nilfs_clean_segments() method calls nilfs_segctor_construct() that updates superblocks in the case of NILFS_SC_SUPER_ROOT and THE_NILFS_DISCONTINUED flags are set. But when GC is working alone the nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag. As a result, the update of superblocks doesn't occurred all this time and in the case of SPOR superblocks keep very old values of last super root placement. SYMPTOMS: Trying to mount a NILFS2 volume after SPOR in such environment ends with very long mounting time (it can achieve about several hours in some cases). REPRODUCING PATH: 1. It needs to use external USB HDD, disable automount and doesn't make any additional filesystem activity on the NILFS2 volume. 2. Generate temporary file with size about 100 - 500 GB (for example, dd if=/dev/zero of=<file_name> bs=1073741824 count=200). The size of file defines duration of GC working. 3. Then it needs to delete file. 4. Start GC manually by means of command "nilfs-clean -p 0". When you start GC by means of such way then, at the end, superblocks is updated by once. So, for simulation of SPOR, it needs to wait sometime (15 - 40 minutes) and simply switch off USB HDD manually. 5. Switch on USB HDD again and try to mount NILFS2 volume. As a result, NILFS2 volume will mount during very long time. REPRODUCIBILITY: 100% FIX: This patch adds checking that superblocks need to update and set THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call. Reported-by: Sergey Alexandrov <splavgm@gmail.com> Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Tested-by: Vyacheslav Dubeyko <slava@dubeyko.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-05 20:38:46 +11:00
David Teigland	d4b0bcf32b	dlm: check the write size from user Return EINVAL from write if the size is larger than allowed. Do this before allocating kernel memory for the bogus size, which could lead to OOM. Reported-by: Sasha Levin <levinsasha928@gmail.com> Tested-by: Jana Saout <jana@saout.de> Signed-off-by: David Teigland <teigland@redhat.com>	2013-02-04 15:31:22 -06:00
Chris Mason	bb721703aa	Btrfs: reduce CPU contention while waiting for delayed extent operations We batch up operations to the extent allocation tree, which allows us to deal with the recursive nature of using the extent allocation tree to allocate extents to the extent allocation tree. It also provides a mechanism to sort and collect extent operations, which makes it much more efficient to record extents that are close together. The delayed extent operations must all be finished before the running transaction commits, so we have code to make sure and run a few of the batched operations when closing our transaction handles. This creates a great deal of contention for the locks in the delayed extent operation tree, and also contention for the lock on the extent allocation tree itself. All the extra contention just slows down the operations and doesn't get things done any faster. This commit changes things to use a wait queue instead. As procs want to run the delayed operations, one of them races in and gets permission to hit the tree, and the others step back and wait for progress to be made. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:25 -05:00
Chris Mason	242e18c7c1	Btrfs: reduce lock contention on extent buffer locks The extent buffers have a refs_lock which we use to make coordinate freeing the extent buffer with operations on the radix tree. On tree roots and other extent buffers that very cache hot, this can be highly contended. These are also the extent buffers that are basically pinned in memory. This commit adds code to cmpxchg our way through the ref modifications, and as long as the result of the reference change is still pinned in ram, we skip the expensive spinlock. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:25 -05:00
Chris Mason	8de972b4fa	Btrfs: fix cluster alignment for mount -o ssd With the new raid56 code, we want to make sure we're properly aligning our allocation clusters with -o ssd Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:24 -05:00
Chris Mason	6ac0f4884e	Btrfs: add a plugging callback to raid56 writes Buffered writes and DIRECT_IO writes will often break up big contiguous changes to the file into sub-stripe writes. This adds a plugging callback to gather those smaller writes full stripe writes. Example on flash: fio job to do 64K writes in batches of 3 (which makes a full stripe): With plugging: 450MB/s Without plugging: 220MB/s Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:24 -05:00
Chris Mason	4ae10b3a13	Btrfs: Add a stripe cache to raid56 The stripe cache allows us to avoid extra read/modify/write cycles by caching the pages we read off the disk. Pages are cached when: * They are read in during a read/modify/write cycle * They are written during a read/modify/write cycle * They are involved in a parity rebuild Pages are not cached if we're doing a full stripe write. We're assuming that a full stripe write won't be followed by another partial stripe write any time soon. This provides a substantial boost in performance for workloads that synchronously modify adjacent offsets in the file, and for the parity rebuild use case in general. The size of the stripe cache isn't tunable (yet) and is set at 1024 entries. Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct Without the stripe cache -- 2.1MB/s With the stripe cache 21MB/s Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:23 -05:00
David Woodhouse	53b381b3ab	Btrfs: RAID5 and RAID6 This builds on David Woodhouse's original Btrfs raid5/6 implementation. The code has changed quite a bit, blame Chris Mason for any bugs. Read/modify/write is done after the higher levels of the filesystem have prepared a given bio. This means the higher layers are not responsible for building full stripes, and they don't need to query for the topology of the extents that may get allocated during delayed allocation runs. It also means different files can easily share the same stripe. But, it does expose us to incorrect parity if we crash or lose power while doing a read/modify/write cycle. This will be addressed in a later commit. Scrub is unable to repair crc errors on raid5/6 chunks. Discard does not work on raid5/6 (yet) The stripe size is fixed at 64KiB per disk. This will be tunable in a later commit. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:23 -05:00
David Woodhouse	64a167011b	Btrfs: add rw argument to merge_bio_hook() We'll want to merge writes so they can fill a full RAID[56] stripe, but not necessarily reads. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 11:49:47 -05:00
Eric Sandeen	3c91160808	btrfs: don't try to notify udev about missing devices If we remove a missing device, bdev is null, and if we send that off to btrfs_kobject_uevent we'll panic. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 11:47:37 -05:00
Linus Torvalds	bf6c8a8148	NFS client bugfixe for Linux 3.8 - Error reporting in nfs_xdev_mount incorrectly maps all errors to ENOMEM - Fix an NFSv4 refcounting issue - Fix a mount failure when the server reboots during NFSv4 trunking discovery - NFSv4.1 mounts may need to run the lease recovery thread. - Don't silently fail setattr() requests on mountpoints - Fix a SUNRPC socket/transport livelock and priority queue issue - We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) iQIcBAABAgAGBQJRCpS4AAoJEGcL54qWCgDyqucP/2CTv5leu+X5/0PXBOykAIHg 8oEsTEz7/4IvIxTXzuHDYirMnm/mulfGF6NrdPqfvxpAHqRfVBfLFocfNLMVhQci 97RmBfEEGM22AToYUubML5bIxr0QllV4s9Vmyh/zGDan52y7zNNlZX+v6aLjZbJB Fbolihpcch6lhQEUNAzK0B0ddimDl9lazx/WTmMOD/JrwOqzA4FJC+YxBe88nfzQ c6sYyEptBaSirbCOlueqGpv8skB1CLpFJXguXToPXFxpWed6uoGrIwLO7MLUdFpJ Xw+j8cuv/wjyYJGVKjhW7kXtwK8T7+u4bT2L883R01XYXr8XfkkLON0dgG1X/unk 80mLzCO1+qRdoDSQ4b/V4B0nScPRCJuoZpftjCi2uhKewNcxQPMZ2V5/D7pO3uyE NdhxByB8D86JfNrIcBcRaxfuiQsurQBDvsDNmWPBZSOmH/dmHqTQGLcIe6N94A0B c7KFtXrN2MzOl8S68dUpbhftObq9X0oK2oxFFLWRQoqjrFtiDLU5JilV0bEH2CMo gJX7CRrPoJ2wmNToKdPJRWBYDmtMMIThq3vIpCj1FFzX17r5grJpqCmp/TViu/ER r8rmJzni+nmHaO1NLJMTCHzLJ8soiKyBq8PZKAbipTJxy+TXI/69jnWzpIQ/pbM/ JN+0tiCcqmUmXsO+hrCP =lWFV -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: - Error reporting in nfs_xdev_mount incorrectly maps all errors to ENOMEM - Fix an NFSv4 refcounting issue - Fix a mount failure when the server reboots during NFSv4 trunking discovery - NFSv4.1 mounts may need to run the lease recovery thread. - Don't silently fail setattr() requests on mountpoints - Fix a SUNRPC socket/transport livelock and priority queue issue - We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session. * tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session SUNRPC: When changing the queue priority, ensure that we change the owner NFS: Don't silently fail setattr() requests on mountpoints NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery NFSv4: Fix NFSv4 trunking discovery NFSv4: Fix NFSv4 reference counting for trunked sessions NFS: Fix error reporting in nfs_xdev_mount	2013-02-01 08:43:52 +11:00
Trond Myklebust	c489ee290b	NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session NFS4ERR_DELAY is a legal reply when we call DESTROY_SESSION. It usually means that the server is busy handling an unfinished RPC request. Just sleep for a second and then retry. We also need to be able to handle the NFS4ERR_BACK_CHAN_BUSY return value. If the NFS server has outstanding callbacks, we just want to similarly sleep & retry. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org	2013-01-30 17:45:15 -05:00
Trond Myklebust	ab22541782	NFS: Don't silently fail setattr() requests on mountpoints Ensure that any setattr and getattr requests for junctions and/or mountpoints are sent to the server. Ever since commit `0ec26fd069` (vfs: automount should ignore LOOKUP_FOLLOW), we have silently dropped any setattr requests to a server-side mountpoint. For referrals, we have silently dropped both getattr and setattr requests. This patch restores the original behaviour for setattr on mountpoints, and tries to do the same for referrals, provided that we have a filehandle... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org	2013-01-30 17:41:04 -05:00
Linus Torvalds	f96736e1ba	xfs: bugfixes for 3.8-rc6 - fix return value when filesystem probe finds no XFS magic, a regression introduced in `9802182`. - fix stack switch in __xfs_bmapi_allocate by moving the check for stack switch up into xfs_bmapi_write. - fix oops in _xfs_buf_find by validating that the requested block is within the filesystem bounds. - limit speculative preallocation near ENOSPC. - fix an unmount hang in xfs_wait_buftarg by freeing the xfs_buf_log_item in xfs_buf_item_unlock. - fix a possible use after free with AIO. - fix xfs_swap_extents after removal of xfs_flushinval_pages, a regression introduced in `fb59581404`. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJRBvmgAAoJENaLyazVq6ZOOacP/RilCPsi41NkJqRx1Rs5aRGE UvinrHfAL/tBupS2JVo1niIilBNJG/cI+lLcpV/P5omLBJfpEu0trzZUSxU7S1Vc a2M8J0qhmKfBcl70fuCALAxPY52895y+44gufxaH0O5HQDN6tB8n4MMqYGPmS8hz Ul/q3MO601hVyBHaoYa7BNGS3YG0TCdFGtWcC5tQaR3v7upTLR2ouZrGQ8CV0BBa Ek1xdxLh4D0fRybSL7lUw64W957iyldoLsEg+zQrE9NSfTE8DSqUG+NPWB0wjPce ICtmO6TbE5c6q1ScOL3YCC2cmYvjR9mlAHnPy73SqWSIsTqUsVzdibNo+tUJJZ5r RZf3u6Uri6uKC6Hl4XEtg4LVnnquKosTXfoiHmn+eh0dhYL7sZG0Ya5we5pH5Tmi P6B2DlfUA1fj4Ne4Asx2d7mwOJaZcLHDZoeCs/Haz2Z6kGVEm7ImyAb1h76uNOZo l0NFhXJGcOQLyjPtQjl81SGjQmntiIN0Poia3528zjxxGXlBNwAwalkOtdJnk5iN IaYRKtvIcrdjFvunasiKZIsV/O9w3/mguXlrSqBDgUsKPUc/cq5vLfsa70jYGc2j M6ldJRRqTvSjkVXc/7SXv4GLt/qbUWa92ESzhZQXEABIjZJUnOZpZuLmSVdRUyZk +SaGvbMphE0U/4ps3pO/ =wViN -----END PGP SIGNATURE----- Merge tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs Pull xfs bugfixes from Ben Myers: "Here are fixes for returning EFSCORRUPTED on probe of a non-xfs filesystem, the stack switch in xfs_bmapi_allocate, a crash in _xfs_buf_find, speculative preallocation as the filesystem nears ENOSPC, an unmount hang, a race with AIO, and a regression with xfs_fsr: - fix return value when filesystem probe finds no XFS magic, a regression introduced in `9802182`. - fix stack switch in __xfs_bmapi_allocate by moving the check for stack switch up into xfs_bmapi_write. - fix oops in _xfs_buf_find by validating that the requested block is within the filesystem bounds. - limit speculative preallocation near ENOSPC. - fix an unmount hang in xfs_wait_buftarg by freeing the xfs_buf_log_item in xfs_buf_item_unlock. - fix a possible use after free with AIO. - fix xfs_swap_extents after removal of xfs_flushinval_pages, a regression introduced in commit fb59581404a." * tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs: xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages() xfs: Fix possible use-after-free with AIO xfs: fix shutdown hang on invalid inode during create xfs: limit speculative prealloc near ENOSPC thresholds xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end xfs: pull up stack_switch check into xfs_bmapi_write xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic	2013-01-30 11:59:37 +11:00
Torsten Kaiser	65e3aa77f1	xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages() Commit `fb59581404` removed xfs_flushinval_pages() and changed its callers to use filemap_write_and_wait() and truncate_pagecache_range() directly. But in xfs_swap_extents() this change accidental switched the argument for 'tip' to 'ip'. This patch switches it back to 'tip' Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 16:05:10 -06:00
Jan Kara	4b05d09c18	xfs: Fix possible use-after-free with AIO Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. CC: xfs@oss.sgi.com CC: Ben Myers <bpm@sgi.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:51:22 -06:00
Dave Chinner	9f87832a82	xfs: fix shutdown hang on invalid inode during create When the new inode verify in xfs_iread() fails, the create transaction is aborted and a shutdown occurs. The subsequent unmount then hangs in xfs_wait_buftarg() on a buffer that has an elevated hold count. Debug showed that it was an AGI buffer getting stuck: [ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck [ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck The trace of this buffer leading up to the shutdown (trimmed for brevity) looks like: xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1 xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1 xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock And that is the AGI buffer from cold cache read into memory to transaction abort. You can see at transaction abort the bli is dirty and only has a single reference. The item is not pinned, and it's not in the AIL. Hence the only reference to it is this transaction. The problem is that the xfs_buf_item_unlock() call is dropping the last reference to the xfs_buf_log_item attached to the buffer (which holds a reference to the buffer), but it is not freeing the xfs_buf_log_item. Hence nothing will ever release the buffer, and the unmount hangs waiting for this reference to go away. The fix is simple - xfs_buf_item_unlock needs to detect the last reference going away in this case and free the xfs_buf_log_item to release the reference it holds on the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:51:12 -06:00
Dave Chinner	f2a459565b	xfs: limit speculative prealloc near ENOSPC thresholds There is a window on small filesytsems where specualtive preallocation can be larger than that ENOSPC throttling thresholds, resulting in specualtive preallocation trying to reserve more space than there is space available. This causes immediate ENOSPC to be triggered, prealloc to be turned off and flushing to occur. One the next write (i.e. next 4k page), we do exactly the same thing, and so effective drive into synchronous 4k writes by triggering ENOSPC flushing on every page while in the window between the prealloc size and the ENOSPC prealloc throttle threshold. Fix this by checking to see if the prealloc size would consume all free space, and throttle it appropriately to avoid premature ENOSPC... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:50:50 -06:00
Dave Chinner	eb178619f9	xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end When _xfs_buf_find is passed an out of range address, it will fail to find a relevant struct xfs_perag and oops with a null dereference. This can happen when trying to walk a filesystem with a metadata inode that has a partially corrupted extent map (i.e. the block number returned is corrupt, but is otherwise intact) and we try to read from the corrupted block address. In this case, just fail the lookup. If it is readahead being issued, it will simply not be done, but if it is real read that fails we will get an error being reported. Ideally this case should result in an EFSCORRUPTED error being reported, but we cannot return an error through xfs_buf_read() or xfs_buf_get() so this lookup failure may result in ENOMEM or EIO errors being reported instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:49:21 -06:00
Brian Foster	d26978dd86	xfs: pull up stack_switch check into xfs_bmapi_write The stack_switch check currently occurs in __xfs_bmapi_allocate, which means the stack switch only occurs when xfs_bmapi_allocate() is called in a loop. Pull the check up before the loop in xfs_bmapi_write() such that the first iteration of the loop has consistent behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:48:55 -06:00
Eric Sandeen	1bee12b8c4	xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic `9802182` changed the return value from EWRONGFS (aka EINVAL) to EFSCORRUPTED which doesn't seem to be handled properly by the root filesystem probe. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-01-28 12:48:21 -06:00
David Teigland	d4e0bfec9b	GFS2: fix skip unlock condition The recent commit `fb6791d100` included the wrong logic. The lvbptr check was incorrectly added after the patch was tested. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2013-01-28 09:49:15 +00:00
Trond Myklebust	65436ec0c8	NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery We do need to start the lease recovery thread prior to waiting for the client initialisation to complete in NFSv4.1. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ben Greear <greearb@candelatech.com> Cc: stable@vger.kernel.org [>=3.7]	2013-01-27 15:51:41 -05:00
Trond Myklebust	202c312dba	NFSv4: Fix NFSv4 trunking discovery If walking the list in nfs4[01]_walk_client_list fails, then the most likely explanation is that the server dropped the clientid before we actually managed to confirm it. As long as our nfs_client is the very last one in the list to be tested, the caller can be assured that this is the case when the final return value is NFS4ERR_STALE_CLIENTID. Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org [>=3.7] Tested-by: Ben Greear <greearb@candelatech.com>	2013-01-27 15:51:28 -05:00
Trond Myklebust	4ae19c2dd7	NFSv4: Fix NFSv4 reference counting for trunked sessions The reference counting in nfs4_init_client assumes wongly that it is safe for nfs4_discover_server_trunking() to return a pointer to a nfs_client prior to bumping the reference count. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ben Greear <greearb@candelatech.com> Cc: stable@vger.kernel.org [>=3.7]	2013-01-27 15:51:15 -05:00
Trond Myklebust	dee972b967	NFS: Fix error reporting in nfs_xdev_mount Currently, nfs_xdev_mount converts all errors from clone_server() to ENOMEM, which can then leak to userspace (for instance to 'mount'). Fix that. Also ensure that if nfs_fs_mount_common() returns an error, we don't dprintk(0)... The regression originated in commit `3d176e3fe4` (NFS: Use nfs_fs_mount_common() for xdev mounts) Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [>= 3.5]	2013-01-27 15:51:15 -05:00
Linus Torvalds	d7df025eb4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "It turns out that we had two crc bugs when running fsx-linux in a loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it all down. Miao also has a new OOM fix in this v2 pull as well. Ilya fixed a regression Liu Bo found in the balance ioctls for pausing and resuming a running balance across drives. Josef's orphan truncate patch fixes an obscure corruption we'd see during xfstests. Arne's patches address problems with subvolume quotas. If the user destroys quota groups incorrectly the FS will refuse to mount. The rest are smaller fixes and plugs for memory leaks." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits) Btrfs: fix repeated delalloc work allocation Btrfs: fix wrong max device number for single profile Btrfs: fix missed transaction->aborted check Btrfs: Add ACCESS_ONCE() to transaction->abort accesses Btrfs: put csums on the right ordered extent Btrfs: use right range to find checksum for compressed extents Btrfs: fix panic when recovering tree log Btrfs: do not allow logged extents to be merged or removed Btrfs: fix a regression in balance usage filter Btrfs: prevent qgroup destroy when there are still relations Btrfs: ignore orphan qgroup relations Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag Btrfs: fix unlock order in btrfs_ioctl_rm_dev Btrfs: fix unlock order in btrfs_ioctl_resize Btrfs: fix "mutually exclusive op is running" error code Btrfs: bring back balance pause/resume logic btrfs: update timestamps on truncate() btrfs: fix btrfs_cont_expand() freeing IS_ERR em Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents Btrfs: fix off-by-one in lseek ...	2013-01-25 10:55:21 -08:00
Linus Torvalds	66e2d3e8c2	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Two small cifs fixes" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: fs/cifs/cifs_dfs_ref.c: fix potential memory leakage cifs: fix srcip_matches() for ipv6	2013-01-24 19:15:43 -08:00
Miao Xie	1eafa6c737	Btrfs: fix repeated delalloc work allocation btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/ btrfs_queue_worker for this inode, and then it locks the list, checks the head of the list again. But because we don't delete the first inode that it deals with before, it will fetch the same inode. As a result, this function allocates a huge amount of btrfs_delalloc_work structures, and OOM happens. Fix this problem by splice this delalloc list. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:27 -05:00
Miao Xie	c9f01bfe0c	Btrfs: fix wrong max device number for single profile The max device number of single profile is 1, not 0 (0 means 'as many as possible'). Fix it. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:26 -05:00
Miao Xie	2cba30f172	Btrfs: fix missed transaction->aborted check First, though the current transaction->aborted check can stop the commit early and avoid unnecessary operations, it is too early, and some transaction handles don't end, those handles may set transaction->aborted after the check. Second, when we commit the transaction, we will wake up some worker threads to flush the space cache and inode cache. Those threads also allocate some transaction handles and may set transaction->aborted if some serious error happens. So we need more check for ->aborted when committing the transaction. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:25 -05:00
Miao Xie	8d25a086eb	Btrfs: Add ACCESS_ONCE() to transaction->abort accesses We may access and update transaction->aborted on the different CPUs without lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating unsolicited accesses and make sure we can get the right value. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:23 -05:00
Josef Bacik	e58dd74bcc	Btrfs: put csums on the right ordered extent I noticed a WARN_ON going off when adding csums because we were going over the amount of csum bytes that should have been allowed for an ordered extent. This is a leftover from when we used to hold the csums privately for direct io, but now we use the normal ordered sum stuff so we need to make sure and check if we've moved on to another extent so that the csums are added to the right extent. Without this we could end up with csums for bytenrs that don't have extents to cover them yet. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:22 -05:00
Liu Bo	192000dda2	Btrfs: use right range to find checksum for compressed extents For compressed extents, the range of checksum is covered by disk length, and the disk length is different with ram length, so we need to use disk length instead to get us the right checksum. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:17 -05:00
Josef Bacik	b0175117b9	Btrfs: fix panic when recovering tree log A user reported a BUG_ON(ret) that occured during tree log replay. Ret was -EAGAIN, so what I think happened is that we removed an extent that covered a bitmap entry and an extent entry. We remove the part from the bitmap and return -EAGAIN and then search for the next piece we want to remove, which happens to be an entire extent entry, so we just free the sucker and return. The problem is ret is still set to -EAGAIN so we trip the BUG_ON(). The user used btrfs-zero-log so I'm not 100% sure this is what happened so I've added a WARN_ON() to catch the other possibility. Thanks, Reported-by: Jan Steffens <jan.steffens@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:49:49 -05:00
Josef Bacik	201a903894	Btrfs: do not allow logged extents to be merged or removed We drop the extent map tree lock while we're logging extents, so somebody could come in and merge another extent into this one and screw up our logging, or they could even remove us from the list which would keep us from logging the extent or freeing our ref on it, so we need to make sure to not clear LOGGING until after the extent is logged, and then we can merge it to adjacent extents. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:49:48 -05:00
Cong Ding	10b8c7dff5	fs/cifs/cifs_dfs_ref.c: fix potential memory leakage When it goes to error through line 144, the memory allocated to devname is not freed, and the caller doesn't free it either in line 250. So we free the memroy of devname in function cifs_compose_mount_options() when it goes to error. Signed-off-by: Cong Ding <dinggnu@gmail.com> CC: stable <stable@kernel.org> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2013-01-22 23:58:16 -06:00
Linus Torvalds	262060ea46	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "This contain a bugfix for CUSE and miscellaneous small fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: remove unused variable in fuse_try_move_page() fuse: make fuse_file_fallocate() static fuse: Move CUSE Kconfig entry from fs/Kconfig into fs/fuse/Kconfig cuse: fix uninitialized variable warnings cuse: do not register multiple devices with identical names cuse: use mutex as registration lock instead of spinlocks	2013-01-22 11:53:19 -08:00
Linus Torvalds	05c2cf35c3	f2fs fixes for v3.8-rc5 o Support swap file and link generic_file_remap_pages o Enhance the bio streaming flow and free section control o Major bug fix on recovery routine o Minor bug/warning fixes and code cleanups -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJQ/fpIAAoJEEAUqH6CSFDS6BAP/jz/JIPoSu4NunWgVH68tzA4 xx4lsmJQJQG+1241uA9v4MMkcTzxE0QVNTOX3BpUXBhCMc00aPOPWlWjULridLI9 0vRE6LpG+NntkyKF+M5T1ydGuQoDlEvqoGs6c5p3yaI6PxzbzmBmipsmqXUA8fyu 280+OWJoAcALEMJiQ8JsHcDmvM9wdQ+BV/j3BNCm4dqBUA4dYPfDzRKUJYfwqiig qmVRseJJaekxrQ2lHG/K/WPAXa8aRcV6khP9tv/BPGRMt+I/fli/J4sWEFT6c73B +qYmhrf/RLNJ1O13dePlo3URwMu083PL8QN355GAKUJbMaX/UPjEnq0DLbBOkCS2 KBQI5O1eiFauEE6YU7p7GuvnLeVkukcXSQNVnRrnWzTUA9CMThZZ2mAgb2lz+iEP oZWNirRwnxdcTQRXjPyTCtpPTCgJi4GO1WS5s6HeLP8G8Muo4PzDzcEwMX0Mw5ih s4n4wpQ+Zp3h53cc/DxdFAK15uM3XYtUfb92kJwqaEG5VmBy6KvliXDRnNXg7WMI imCb08c0Fr0M8ZHOd+UCveICcndFj25jkjx8w7PoE1KBbGJkKf+cjpZ3OhOAliux sGtH3EZLV6jx1MjV79OpDXQGoVsvktesCbRcaxc7BbqljXYVNiap8QbzjPnmAt6z KKN0GU32eolGgK4Zd/iF =vJQc -----END PGP SIGNATURE----- Merge tag 'f2fs-for-3.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: o Support swap file and link generic_file_remap_pages o Enhance the bio streaming flow and free section control o Major bug fix on recovery routine o Minor bug/warning fixes and code cleanups * tag 'f2fs-for-3.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (22 commits) f2fs: use _safe() version of list_for_each f2fs: add comments of start_bidx_of_node f2fs: avoid issuing small bios due to several dirty node pages f2fs: support swapfile f2fs: add remap_pages as generic_file_remap_pages f2fs: add __init to functions in init_f2fs_fs f2fs: fix the debugfs entry creation path f2fs: add global mutex_lock to protect f2fs_stat_list f2fs: remove the blk_plug usage in f2fs_write_data_pages f2fs: avoid redundant time update for parent directory in f2fs_delete_entry f2fs: remove redundant call to set_blocksize in f2fs_fill_super f2fs: move f2fs_balance_fs to punch_hole f2fs: add f2fs_balance_fs in several interfaces f2fs: revisit the f2fs_gc flow f2fs: check return value during recovery f2fs: avoid null dereference in f2fs_acl_from_disk f2fs: initialize newly allocated dnode structure f2fs: update f2fs partition info about SIT/NAT layout f2fs: update f2fs document to reflect SIT/NAT layout correctly f2fs: remove unneeded INIT_LIST_HEAD at few places ...	2013-01-22 10:33:17 -08:00
Dan Carpenter	d8b79b2f94	f2fs: use _safe() version of list_for_each This is calling list_del() inside a loop which is a problem when we try move to the next item on the list. I've converted it to use the _safe version. And also, as a cleanup, I've converted it to use list_for_each_entry instead of list_for_each. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-22 10:49:00 +09:00
Jaegeuk Kim	9af45ef5ab	f2fs: add comments of start_bidx_of_node The caller of start_bidx_of_node() should give proper node offsets which point only direct node blocks. Otherwise, it is a caller's bug. This patch adds comments to make it clear. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>	2013-01-22 10:48:59 +09:00
Jaegeuk Kim	a7fdffbd3e	f2fs: avoid issuing small bios due to several dirty node pages If some small bios of dirty node pages are supposed to be issued during the sequential data writes, there-in well-produced consecutive data bios are able to be split by the small node bios, resulting in performance degradation. So, let's collect a number of dirty node pages until reaching a threshold. And, by default, I set the threshold as 2MB, a segment size. This improves sequential write performance on i5, 512GB SSD (830 w/ SATA2) as follows. Before: 231 MB/s -> After: 255 MB/s Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com> Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>	2013-01-22 10:48:59 +09:00

1 2 3 4 5 ...

30276 Commits