Commit Graph

2942 Commits

Author SHA1 Message Date
Josef Bacik
de0022b9da Btrfs: do not async metadata csumming in certain situations
There are a coule scenarios where farming metadata csumming off to an async
thread doesn't help.  The first is if our processor supports crc32c, in
which case the csumming will be fast and so the overhead of the async model
is not worth the cost.  The other case is for our tree log.  We will be
making that stuff dirty and writing it out and waiting for it immediately.
Even with software crc32c this gives me a ~15% increase in speed with O_SYNC
workloads.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09 09:15:40 -04:00
Zach Brown
221b831835 btrfs: fix min csum item size warnings in 32bit
commit 7ca4be45a0 limited csum items to
PAGE_CACHE_SIZE.  It used min() with incompatible types in 32bit which
generates warnings:

fs/btrfs/file-item.c: In function ‘btrfs_csum_file_blocks’:
fs/btrfs/file-item.c:717: warning: comparison of distinct pointer types lacks a cast

This uses min_t(u32,) to fix the warnings.  u32 seemed reasonable
because btrfs_root->leafsize is u32 and PAGE_CACHE_SIZE is unsigned
long.

Signed-off-by: Zach Brown <zab@zabbo.net>
2012-10-09 09:15:39 -04:00
Josef Bacik
67b0fd63d5 Btrfs: run delayed refs first when out of space
Running delayed refs is faster than running delalloc, so lets do that first
to try and reclaim space.  This makes my fs_mark test about 20% faster.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09 09:15:39 -04:00
Miao Xie
354aa0fb6d Btrfs: fix orphan transaction on the freezed filesystem
With the following debug patch:

 static int btrfs_freeze(struct super_block *sb)
 {
+ 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+	struct btrfs_transaction *trans;
+
+	spin_lock(&fs_info->trans_lock);
+	trans = fs_info->running_transaction;
+	if (trans) {
+		printk("Transid %llu, use_count %d, num_writer %d\n",
+			trans->transid, atomic_read(&trans->use_count),
+			atomic_read(&trans->num_writers));
+	}
+	spin_unlock(&fs_info->trans_lock);
 	return 0;
 }

I found there was a orphan transaction after the freeze operation was done.

It is because the transaction may not be committed when the transaction handle
end even though it is the last handle of the current transaction. This design
avoid committing the transaction frequently, but also introduce the above
problem.

So I add btrfs_attach_transaction() which can catch the current transaction
and commit it. If there is no transaction, it will return ENOENT, and do not
anything.

This function also can be used to instead of btrfs_join_transaction_freeze()
because it don't increase the writer counter and don't start a new transaction,
so it also can fix the deadlock between sync and freeze.

Besides that, it is used to instead of btrfs_join_transaction() in
transaction_kthread(), because if there is no transaction, the transaction
kthread needn't anything.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-09 09:15:39 -04:00
Miao Xie
a698d0755a Btrfs: add a type field for the transaction handle
This patch add a type field into the transaction handle structure,
in this way, we needn't implement various end-transaction functions
and can make the code more simple and readable.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-09 09:15:38 -04:00
Miao Xie
e8830e606f Btrfs: fix memory leak in start_transaction()
This patch fixes memory leak of the transaction handle which happened
when starting transaction failed on a freezed fs.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-09 09:15:38 -04:00
Mark Fasheh
d24bec3ae5 btrfs: extended inode ref iteration
The iterate_irefs in backref.c is used to build path components from inode
refs. This patch adds code to iterate extended refs as well.

I had modify the callback function signature to abstract out some of the
differences between ref structures. iref_to_path() also needed similar
changes.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-10-09 09:15:01 -04:00
Mark Fasheh
f186373fef btrfs: extended inode refs
This patch adds basic support for extended inode refs. This includes support
for link and unlink of the refs, which basically gets us support for rename
as well.

Inode creation does not need changing - extended refs are only added after
the ref array is full.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-10-09 09:14:45 -04:00
Konstantin Khlebnikov
0b173bc4da mm: kill vma flag VM_CAN_NONLINEAR
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().

Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:17 +09:00
Jan Schmidt
5a1d7843ca btrfs: improved readablity for add_inode_ref
Moved part of the code into a sub function and replaced most of the gotos
by ifs, hoping that it will be easier to read now.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-10-08 20:09:02 -04:00
Josef Bacik
0aa4a17d82 Btrfs: handle not finding the extent exactly when logging changed extents
I started hitting warnings when running xfstest 68 in a loop because there
were EM's that were not lined up properly with the physical extents.  This
is ok, if we do something like punch a hole or write to a preallocated space
or something like that we can have an EM that doesn't cover the entire
physical extent.  So fix the tree logging stuff to cope with this case so we
don't just commit the transaction.  With this patch I no longer see the
warnings from the tree logging code.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-08 20:09:02 -04:00
David Sterba
005d6427ac btrfs: move transaction aborts to the point of failure
Call btrfs_abort_transaction as early as possible when an error
condition is detected, that way the line number reported is useful
and we're not clueless anymore which error path led to the abort.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-08 20:09:02 -04:00
Miao Xie
8732d44f80 Btrfs: fix the missing error information in create_pending_snapshot()
The macro btrfs_abort_transaction() can get the line number of the code
where the problem happens, so we should invoke it in the place that the
error occurs, or we will lose the line number.

Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-08 20:07:33 -04:00
Liu Bo
aa42ffd918 Btrfs: fix off-by-one in file clone
Btrfs uses inclusive range end for lock_extent(), unlock_extent() and
related functions, so we made off-by-one errors in file clone.

This fixes it and also fixes some style problems.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-08 20:07:32 -04:00
David Sterba
7e97b8daf6 btrfs: allow setting NOCOW for a zero sized file via ioctl
Hi,

the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.

In short, $subj says it, chattr -C supports it and we want to use it.

The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:

1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
     will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
     warn that the COW flag has not been changed
2.2.1) dtto, no syslog message

Man page of chattr states that

 "If it is set on a file which already has data blocks, it is undefined when
 the blocks assigned to the file will be fully stable."

Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.

My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.

The patch implements 2.2.1.

david

-------------8<-------------------
From: David Sterba <dsterba@suse.cz>

It's safe to turn off checksums for a zero sized file.

http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030

"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.

Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-04 09:40:00 -04:00
Josef Bacik
c3308f84c1 Btrfs: fix punch hole when no extent exists
I saw the warning in btrfs_drop_extent_cache where our end is less than our
start while running xfstests 68 in a loop.  This is because we
unconditionally do drop_end = min(end, extent_end) in
__btrfs_drop_extents(), even though we may not have found an extent in the
range we were looking to drop.  So keep track of wether or not we found
something, and if we didn't just use our end.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04 09:40:00 -04:00
Josef Bacik
926ced123b Btrfs: don't do anything in our ->freeze_fs and ->unfreeze_fs
We do not need to do anything special to freeze or unfreeze, it's all taken
care of by the generic work, and what we currently have is wrong anyway
since we shouldn't be returnning to userspace with mutexes held anyway.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04 09:40:00 -04:00
Josef Bacik
892951a92e Btrfs: remove unused write cache pages hook
The btree inode has it's own write cache pages so we can remove this write
cache pages hook as it's not used.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04 09:39:59 -04:00
Josef Bacik
b5bae2612a Btrfs: fix race when getting the eb out of page->private
We can race when checking wether PagePrivate is set on a page and we
actually have an eb saved in the pages private pointer.  We could have
easily written out this page and released it in the time that we did the
pagevec lookup and actually got around to looking at this page.  So use
mapping->private_lock to ensure we get a consistent view of the
page->private pointer.  This is inline with the alloc and releasepage paths
which use private_lock when manipulating page->private.  Thanks,

Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04 09:39:59 -04:00
Josef Bacik
ff44c6e36d Btrfs: do not hold the write_lock on the extent tree while logging
Dave Sterba pointed out a sleeping while atomic bug while doing fsync.  This
is because I'm an idiot and didn't realize that rwlock's were spin locks, so
we've been holding this thing while doing allocations and such which is not
good.  This patch fixes this by dropping the write lock before we do
anything heavy and re-acquire it when it is done.  We also need to take a
ref on the em's in case their corresponding pages are evicted and mark them
as being logged so that releasepage does not remove them and doesn't remove
them from our local list.  Thanks,

Reported-by: Dave Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04 09:39:58 -04:00
Josef Bacik
98114659e0 Btrfs: fix race with freeze and free space inodes
So we start our freeze, somebody comes in and does an fsync() on a file
where we have to commit a transaction for whatever reason, and we will
deadlock because the freeze is waiting on FS_FREEZE people to stop writing
to the file system, but the transaction is waiting for its free space inodes
to be written out, which are in turn waiting on sb_start_intwrite while
trying to write the file extents.  To fix this we'll just skip the
sb_start_intwrite() if we TRANS_JOIN_NOLOCK since we're being waited on by a
transaction commit so we're safe wrt to freeze and this will keep us from
deadlocking.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04 09:39:58 -04:00
Liu Bo
6bbe3a9c80 Btrfs: kill obsolete arguments in btrfs_wait_ordered_extents
nocow_only is now an obsolete argument.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-04 09:39:57 -04:00
Liu Bo
2e90cf858f Btrfs: cleanup fs_info->hashers
fs_info->hashers is now an obsolete one.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-04 09:39:57 -04:00
Liu Bo
ab26e9d6c8 Btrfs: cleanup for duplicated code in find_free_extent
There is already an 'add free space' phrase in front of this one, we
needn't to redo it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-04 09:39:57 -04:00
Josef Bacik
60376ce4a8 Btrfs: fix race in sync and freeze again
I screwed this up, there is a race between checking if there is a running
transaction and actually starting a transaction in sync where we could race
with a freezer and get ourselves into trouble.  To fix this we need to make
a new join type to only do the try lock on the freeze stuff.  If it fails
we'll return EPERM and just return from sync.  This fixes a hang Liu Bo
reported when running xfstest 68 in a loop.  Thanks,

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04 09:39:56 -04:00
David Sterba
b3ae244e71 btrfs: return EPERM upon rmdir on a subvolume
A subvolume cannot be deleted via rmdir, but the error code ENOTEMPTY
is confusing. Return EPERM instead, as this is not permitted.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-04 09:39:56 -04:00
Wei Yongjun
ebb3dad435 Btrfs: using for_each_set_bit_from to simplify the code
Using for_each_set_bit_from() to simplify the code.

spatch with a semantic match is used to found this.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
2012-10-04 09:39:55 -04:00
Anand Jain
1bcea35597 Btrfs: write_buf is now callable outside send.c
Developing service cmds needs it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
2012-10-04 09:39:55 -04:00
Tsutomu Itoh
b4f359ab06 Btrfs: remove unnecessary code in btree_get_extent()
Unnecessary lookup_extent_mapping() is removed because an error is
returned to the caller.
This patch was made based on the advice from Stefan Behrens, thanks.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-04 09:39:54 -04:00
Tsutomu Itoh
0433f20d43 Btrfs: cleanup of error processing in btree_get_extent()
This patch simplifies a little complex error processing in
btree_get_extent().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-04 09:39:54 -04:00
Linus Torvalds
aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
Kirill A. Shutemov
8c0a853770 fs: push rcu_barrier() from deactivate_locked_super() to filesystems
There's no reason to call rcu_barrier() on every
deactivate_locked_super().  We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths.  E.g.  on my machine exit_group() of a last process in IPC
namespace takes 0.07538s.  rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-02 21:35:55 -04:00
Al Viro
99621b44aa btrfs: reada_extent doesn't need kref for refcount
All increments and decrements are under the same spinlock - have to be,
since they need to protect the radix_tree it's found in.  Just use
int, no need to wank with kref...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-02 21:35:55 -04:00
Linus Torvalds
437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Miao Xie
90abccf2c6 Revert "Btrfs: do not do filemap_write_and_wait_range in fsync"
This reverts commit 0885ef5b56

After applying the above patch, the performance slowed down because the dirty
page flush can only be done by one task, so revert it.

The following is the test result of sysbench:
	Before		After
	24MB/s		39MB/s

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:22 -04:00
Josef Bacik
698d0082c4 Btrfs: remove bytes argument from do_chunk_alloc
Everybody is just making stuff up, and it's just used to see if we really do
need to alloc a chunk, and since we do this when we already know we really
do it's just a waste of space.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:21 -04:00
Josef Bacik
ea658badc4 Btrfs: delay block group item insertion
So we have lots of places where we try to preallocate chunks in order to
make sure we have enough space as we make our allocations.  This has
historically meant that we're constantly tweaking when we should allocate a
new chunk, and historically we have gotten this horribly wrong so we way
over allocate either metadata or data.  To try and keep this from happening
we are going to make it so that the block group item insertion is done out
of band at the end of a transaction.  This will allow us to create chunks
even if we are trying to make an allocation for the extent tree.  With this
patch my enospc tests run faster (didn't expect this) and more efficiently
use the disk space (this is what I wanted).  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:21 -04:00
Kent Overstreet
be3940c0a9 btrfs: Kill some bi_idx references
For immutable bio vecs, I've been auditing and removing bi_idx
references. These were harmless, but removing them will make auditing
easier.

scrub_bio_end_io_worker() was open coding a bio_reset() - but this
doesn't appear to have been needed for anything as right after it does a
bio_put(), and perusing the code it doesn't appear anything else was
holding a reference to the bio.

The other use end_bio_extent_readpage() was just for a pr_debug() -
changed it to something that might be a bit more useful.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-01 15:19:21 -04:00
Miao Xie
962197babe Btrfs: fix unnecessary warning when the fragments make the space alloc fail
When we wrote some data by compress mode into a btrfs filesystem which was full
of the fragments, the kernel will report:
	BTRFS warning (device xxx): Aborting unused transaction.

The reason is:
We can not find a long enough free space to store the compressed data because
of the fragmentary free space, and the compressed data can not be splited,
so the kernel outputed the above message.

In fact, btrfs can deal with this problem very well: it fall back to
uncompressed IO, split the uncompressed data into small ones, and then
store them into to the fragmentary free space. So we shouldn't output the
above warning message.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:20 -04:00
Josef Bacik
69ffb54347 Btrfs: create a pinned em when writing to a prealloc range in DIO
Wade Cline reported a problem where he was getting garbage and warnings when
writing to a preallocated range via O_DIRECT.  This is because we weren't
creating our normal pinned extent_map for the range we were writing to,
which was causing all sorts of issues.  This patch fixes the problem and
makes his testcase much happier.  Thanks,

Reported-by: Wade Cline <clinew@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:20 -04:00
Josef Bacik
6df7881a84 Btrfs: move the sb_end_intwrite until after the throttle logic
Sage reported the following lockdep backtrace

=====================================
[ BUG: bad unlock balance detected! ]
3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
-------------------------------------
btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
[<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
but there are no more locks to release!

other info that might help us debug this:
1 lock held by btrfs-cleaner/7607:
 #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs]

stack backtrace:
Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
Call Trace:
 [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
 [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110
 [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310
 [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160
 [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs]
 [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
 [<ffffffff810b2a95>] lock_release+0xd5/0x220
 [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160
 [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90
 [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
 [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60
 [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
 [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
 [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
 [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs]
 [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
 [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
 [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs]
 [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs]
 [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
 [<ffffffff810791ee>] kthread+0xae/0xc0
 [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
 [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
 [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
 [<ffffffff8163e740>] ? gs_change+0x13/0x13

This is because the throttle stuff can commit the transaction, which expects to
be the one stopping the intwrite stuff, but we've already done it in the
__btrfs_end_transaction.  Moving the sb_end_intewrite after this logic makes the
lockdep go away.  Thanks,

Tested-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:19 -04:00
Liu Bo
425d17a290 Btrfs: use larger limit for translation of logical to inode
This is the change of the kernel side.

Translation of logical to inode used to have an upper limit 4k on
inode container's size, but the limit is not large enough for a data
with a great many of refs, so when resolving logical address,
we can end up with
"ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493"

This changes to regard 64k as the upper limit and use vmalloc instead of
kmalloc to get memory more easily.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:19 -04:00
Liu Bo
df031f0752 Btrfs: use helper for logical resolve
We already have a helper, iterate_inodes_from_logical(), for logical resolve,
so just use it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:18 -04:00
Liu Bo
69917e4312 Btrfs: fix a bug in parsing return value in logical resolve
In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.

It is possible to lose our errors because
(-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.

I'm not sure if it is on purpose, it just looks too hacky if it is.
I'd rather use a real flag and a 'ret' to catch errors.

Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
2012-10-01 15:19:18 -04:00
liubo
0647d6bd16 Btrfs: cleanup for unused ref cache stuff
As ref cache has been removed from btrfs, there is no user on
its lock and its check.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:17 -04:00
Miao Xie
8407aa4643 Btrfs: fix corrupted metadata in the snapshot
When we delete a inode, we will remove all the delayed items including delayed
inode update, and then truncate all the relative metadata. If there is lots of
metadata, we will end the current transaction, and start a new transaction to
truncate the left metadata. In this way, we will leave a inode item that its
link counter is > 0, and also may leave some directory index items in fs/file tree
after the current transaction ends. In other words, the metadata in this fs/file tree
is inconsistent. If we create a snapshot for this tree now, we will find a inode with
corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
because its link counter is not 0.

We fix this problem by updating the inode item before the current transaction ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:17 -04:00
David Sterba
837e197283 btrfs: polish names of kmem caches
Usecase:

  watch 'grep btrfs < /proc/slabinfo'

easy to watch all caches in one go.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-01 15:19:16 -04:00
Josef Bacik
a80c8dcf7e Btrfs: fix our overcommit math
I noticed I was seeing large lags when running my torrent test in a vm on my
laptop.  While trying to make it lag less I noticed that our overcommit math
was taking into account the number of bytes we wanted to reclaim, not the
number of bytes we actually wanted to allocate, which means we wouldn't
overcommit as often.  This patch fixes the overcommit math and makes
shrink_delalloc() use that logic so that it will stop looping faster.  We
still have pretty high spikes of latency, but the test now takes 3 minutes
less time (about 5% faster).  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:16 -04:00
Josef Bacik
dea31f5233 Btrfs: wait on async pages when shrinking delalloc
Mitch reported a problem where you could get an ENOSPC error when untarring
a kernel git tree onto a 16gb file system with compress-force=zlib.  This is
because compression is a huge pain, it will return from ->writepages()
without having actually created any ordered extents.  To get around this we
check to see if the async submit counter is up, and if it is wait until it
drops to 0 before doing our normal ordered wait dance.  With this patch I
can now untar a kernel git tree onto a 16gb file system without getting
ENOSPC errors.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:15 -04:00
Liu Bo
9e8a4a8b0b Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag
We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:

We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.

Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:15 -04:00
Tsutomu Itoh
3d6b5c3b5c Btrfs: check return value of ulist_alloc() properly
ulist_alloc() has the possibility of returning NULL.
So, it is necessary to check the return value.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-01 15:19:14 -04:00
Tsutomu Itoh
f54fb859da Btrfs: fix error handling in delete_block_group_cache()
btrfs_iget() never return NULL.
So, NULL check is unnecessary.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-01 15:19:14 -04:00
Miao Xie
903889f462 Btrfs: fix wrong size for the reservation when doing, file pre-allocation.
When we ran fsstress(a program in xfstests), the filesystem hung up when it
is full. It was because the space reserved in btrfs_fallocate() was wrong,
btrfs_fallocate() just used the size of the pre-allocation to reserve the
space, didn't took the block size aligning into account, so the size of
the reserved space was less than the allocated space, it caused the over
reserve problem and made the filesystem hung up when invoking cow_file_range().
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:14 -04:00
Miao Xie
69ce977a17 Btrfs: output more information when aborting a unused transaction handle
Though we dump the stack information when aborting a unused transaction
handle, we don't know the correct place where we decide to abort the
transaction handle if one function has several place where the transaction
abort function is invoked and jumps to the same place after this call.
And beside that we also don't know the reason why we jump to abort
the current handle. So I modify the transaction abort function and make
it output the function name, line and error information.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:13 -04:00
Miao Xie
2ecb79239b Btrfs: fix unprotected ->log_batch
We forget to protect ->log_batch when syncing a file, this patch fix
this problem by atomic operation. And ->log_batch is used to check
if there are parallel sync operations or not, so it is unnecessary to
reset it to 0 after the sync operation of the current log tree complete.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:12 -04:00
Miao Xie
48c03c4bcf Btrfs: fix wrong size for the reservation of the, snapshot creation
We should insert/update 6 items(root ref, root backref, dir item, dir index,
root item and parent inode) when creating a snapshot, not 5 items, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:12 -04:00
Miao Xie
42874b3db7 Btrfs: fix the snapshot that should not exist
The snapshot should be the image of the fs tree before it was created,
so the metadata of the snapshot should not exist in the its tree. But now, we
found the directory item and directory name index is in both the snapshot tree
and the fs tree. It introduces some problems and makes the users feel strange:

 # mkfs.btrfs /dev/sda1
 # mount /dev/sda1 /mnt
 # mkdir /mnt/1
 # cd /mnt/1
 # btrfs subvolume snapshot /mnt snap0
 # ls -a /mnt/1/snap0/1
 .	..	[no other file/dir]

 # ll /mnt/1/snap0/
 total 0
 drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
			^^^
			There is no file/dir in it, but it's size is 10

 # cd /mnt/1/snap0/1/snap0
 [Enter a unexisted directory successfully...]

There is nothing in the directory 1 in snap0, but btrfs told the length of
this directory is 10. Beside that, we can enter an unexisted directory, it is
very strange to the users.

 # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
 # ll /mnt/1/snap0/1/
 total 0
 [None]
 # ll /mnt/snap1/1/
 total 0
 drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0

And the source of snap1 did have any directory in Directory 1, but snap1 have
a snap0, it is different between the source and the snapshot.

So I think we should insert directory item and directory name index and update
the parent inode as the last step of snapshot creation, and do not leave the
useless metadata in the file tree.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:12 -04:00
Miao Xie
66d8f3dd1c Btrfs: add a new "type" field into the block reservation structure
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:11 -04:00
Miao Xie
6352b91da1 Btrfs: use a slab for ordered extents allocation
The ordered extent allocation is in the fast path of the IO, so use a slab
to improve the speed of the allocation.

 "Size of the struct is 280, so this will fall into the size-512 bucket,
  giving 8 objects per page, while own slab will pack 14 objects into a page.

  Another benefit I see is to check for leaked objects when the module is
  removed (and the cache destroy takes place)."
						-- David Sterba

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:11 -04:00
Miao Xie
b9a8cc5bef Btrfs: fix file extent discount problem in the, snapshot
If a snapshot is created while we are writing some data into the file,
the i_size of the corresponding file in the snapshot will be wrong, it will
be beyond the end of the last file extent. And btrfsck will report:
  root 256 inode 257 errors 100

Steps to reproduce:
 # mkfs.btrfs <partition>
 # mount <partition> <mnt>
 # cd <mnt>
 # dd if=/dev/zero of=tmpfile bs=4M count=1024 &
 # for ((i=0; i<4; i++))
 > do
 > btrfs sub snap . $i
 > done

This because the algorithm of disk_i_size update is wrong. Though there are
some ordered extents behind the current one which we use to update disk_i_size,
it doesn't mean those extents will be dealt with in the same transaction. So
We shouldn't use the offset of those extents to update disk_i_size. Or we will
get the wrong i_size in the snapshot.

We fix this problem by recording the max real i_size. If we find there is a
ordered extent which is in front of the current one and doesn't complete, we
will record the end of the current one into that ordered extent. Surely, if
the current extent holds the end of other extent(it must be greater than
the current one because it is behind the current one), we will record the
number that the current extent holds. In this way, we can exclude the ordered
extents that may not be dealth with in the same transaction, and be easy to
know the real disk_i_size.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:10 -04:00
Miao Xie
361048f586 Btrfs: fix full backref problem when inserting shared block reference
If we create several snapshots at the same time, the following BUG_ON() will be
triggered.

	kernel BUG at fs/btrfs/extent-tree.c:6047!

Steps to reproduce:
 # mkfs.btrfs <partition>
 # mount <partition> <mnt>
 # cd <mnt>
 # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
 # for ((i=0; i<4; i++))
 > do
 > mkdir $i
 > for ((j=0; j<200; j++))
 > do
 > btrfs sub snap . $i/$j
 > done &
 > done

The reason is:
Before transaction commit, some operations changed the fs tree and new tree
blocks were allocated because of COW. We used the implicit non-shared back
reference for those newly allocated tree blocks because they were not shared by
two or more trees.

And then we created the first snapshot for the fs tree, according to the back
reference rules, we also used implicit back refs for the child tree blocks of
the root node of the fs tree, now those child nodes/leaves were shared by two
trees.

Then We didn't deal with the delayed references, and continued to change the fs
tree(created the second snapshot and inserted the dir item of the new snapshot
into the fs tree). According to the rules of the back reference, we added full
back refs for those tree blocks whose parents have be shared by two trees.
Now some newly allocated tree blocks had two types of the references.

As we know, the delayed reference system handles these delayed references from
back to front, and the full delayed reference is inserted after the implicit
ones. So when we dealt with the back references of those newly allocated tree
blocks, the full references was dealt with at first. And if the first reference
is a shared back reference and the tree block that the reference points to is
newly allocated, It would be considered as a tree block which is shared by two
or more trees when it is allocated and should be a full back reference not a
implicit one, the flag of its reference also should be set to FULL_BACKREF.
But in fact, it was a non-shared tree block with a implicit reference at
beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
was triggered.

We have several methods to fix this bug:
1. deal with delayed references after the snapshot is created and before we
   change the source tree of the snapshot. This is the easiest and safest way.
2. modify the sort method of the delayed reference tree, make the full delayed
   references be inserted before the implicit ones. It is also very easy, but
   I don't know if it will introduce some problems or not.
3. modify select_delayed_ref() and make it select the implicit delayed reference
   at first. This way is not so good because it may wastes CPU time if we have
   lots of delayed references.
4. set the flags to FULL_BACKREF, this method is a little complex comparing with
   the 1st way.

I chose the 1st way to fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:10 -04:00
Miao Xie
6fa9700e73 Btrfs: fix error path in create_pending_snapshot()
This patch fixes the following problem:
- If we failed to deal with the delayed dir items, we should abort transaction,
  just as its comment said. Fix it.
- If root reference or root back reference insertion failed, we should
  abort transaction. Fix it.
- Fix the double free problem of pending->inherit.
- Do not restore the trans->rsv if we doesn't change it.
- make the error path more clearly.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:09 -04:00
Wei Yongjun
cf93dccea6 Btrfs: fix possible memory leak in scrub_setup_recheck_block()
bbio has been malloced in btrfs_map_block() and should be
freed before leaving from the error handling cases.

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
2012-10-01 15:19:09 -04:00
Josef Bacik
7014cdb493 Btrfs: btrfs_drop_extent_cache should never fail
I noticed this when I was doing the fsync stuff, we allocate split extents if we
drop an extent range that is in the middle of an existing extent.  This BUG()'s
if we fail to allocate memory, but the fact is this is just a cache, we will
just regenerate the cache if we need it, the important part is that we free the
range we are given.  This can be done without allocations, so if we fail to
allocate splits just skip the splitting stage and free our em and look for more
extents to drop.  This also makes btrfs_drop_extent_cache a void since nobody
was checking the return value anyway.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:09 -04:00
Sage Weil
ac14aed665 Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs()
Josef has suggested that this is not necessary.  Removing it also avoids
this lockdep splat (after the new sb_internal locking stuff was added):

[  604.090449] ======================================================
[  604.114819] [ INFO: possible circular locking dependency detected ]
[  604.139262] 3.6.0-rc2-ceph-00144-g463b030 #1 Not tainted
[  604.162193] -------------------------------------------------------
[  604.186139] btrfs-cleaner/6669 is trying to acquire lock:
[  604.209555]  (sb_internal#2){.+.+..}, at: [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
[  604.257100]
[  604.257100] but task is already holding lock:
[  604.300366]  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
[  604.352989]
[  604.352989] which lock already depends on the new lock.
[  604.352989]
[  604.427104]
[  604.427104] the existing dependency chain (in reverse order) is:
[  604.478493]
[  604.478493] -> #1 (&fs_info->cleanup_work_sem){.+.+..}:
[  604.529313]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
[  604.559621]        [<ffffffff81632b69>] down_read+0x39/0x4e
[  604.589382]        [<ffffffffa004db98>] btrfs_lookup_dentry+0x218/0x550 [btrfs]
[  604.596161] btrfs: unlinked 1 orphans
[  604.675002]        [<ffffffffa006aadd>] create_subvol+0x62d/0x690 [btrfs]
[  604.708859]        [<ffffffffa006d666>] btrfs_mksubvol.isra.52+0x346/0x3a0 [btrfs]
[  604.772466]        [<ffffffffa006d7f2>] btrfs_ioctl_snap_create_transid+0x132/0x190 [btrfs]
[  604.842245]        [<ffffffffa006d8ae>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs]
[  604.912852]        [<ffffffffa00708ae>] btrfs_ioctl+0x138e/0x1990 [btrfs]
[  604.951888]        [<ffffffff8118e9b8>] do_vfs_ioctl+0x98/0x560
[  604.989961]        [<ffffffff8118ef11>] sys_ioctl+0x91/0xa0
[  605.026628]        [<ffffffff8163d569>] system_call_fastpath+0x16/0x1b
[  605.064404]
[  605.064404] -> #0 (sb_internal#2){.+.+..}:
[  605.126832]        [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
[  605.163671]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
[  605.200228]        [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
[  605.236818]        [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
[  605.274029]        [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
[  605.340520]        [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
[  605.378720]        [<ffffffff811972c8>] evict+0xb8/0x1c0
[  605.416057]        [<ffffffff811974d5>] iput+0x105/0x210
[  605.452373]        [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
[  605.521627]        [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
[  605.560520]        [<ffffffff810791ee>] kthread+0xae/0xc0
[  605.598094]        [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
[  605.636499]
[  605.636499] other info that might help us debug this:
[  605.636499]
[  605.736504]  Possible unsafe locking scenario:
[  605.736504]
[  605.801931]        CPU0                    CPU1
[  605.835126]        ----                    ----
[  605.867093]   lock(&fs_info->cleanup_work_sem);
[  605.898594]                                lock(sb_internal#2);
[  605.931954]                                lock(&fs_info->cleanup_work_sem);
[  605.965359]   lock(sb_internal#2);
[  605.994758]
[  605.994758]  *** DEADLOCK ***
[  605.994758]
[  606.075281] 2 locks held by btrfs-cleaner/6669:
[  606.104528]  #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b5d5>] cleaner_kthread+0x95/0x120 [btrfs]
[  606.165626]  #1:  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
[  606.231297]
[  606.231297] stack backtrace:
[  606.287723] Pid: 6669, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00144-g463b030 #1
[  606.347823] Call Trace:
[  606.376184]  [<ffffffff8162a77c>] print_circular_bug+0x1fb/0x20c
[  606.409243]  [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
[  606.441343]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.474583]  [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
[  606.505934]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.539429]  [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
[  606.571719]  [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
[  606.603498]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.637405]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
[  606.670165]  [<ffffffff81172e75>] ? kmem_cache_alloc+0xb5/0x160
[  606.702144]  [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
[  606.735562]  [<ffffffffa00256a6>] ? block_rsv_add_bytes+0x56/0x80 [btrfs]
[  606.769861]  [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
[  606.804575]  [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
[  606.838756]  [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
[  606.872010]  [<ffffffff811972c8>] evict+0xb8/0x1c0
[  606.903800]  [<ffffffff811974d5>] iput+0x105/0x210
[  606.935416]  [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
[  606.970510]  [<ffffffffa003b5d5>] ? cleaner_kthread+0x95/0x120 [btrfs]
[  607.005648]  [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
[  607.040724]  [<ffffffffa003b540>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
[  607.104740]  [<ffffffff810791ee>] kthread+0xae/0xc0
[  607.137119]  [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
[  607.169797]  [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
[  607.202472]  [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
[  607.235884]  [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
[  607.268731]  [<ffffffff8163e740>] ? gs_change+0x13/0x13

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 15:19:08 -04:00
Sage Weil
e209db7ace Btrfs: set journal_info in async trans commit worker
We expect current->journal_info to point to the trans handle we are
committing.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 15:19:08 -04:00
Sage Weil
6fc4e35485 Btrfs: pass lockdep rwsem metadata to async commit transaction
The freeze rwsem is taken by sb_start_intwrite() and dropped during the
commit_ or end_transaction().  In the async case, that happens in a worker
thread.  Tell lockdep the calling thread is releasing ownership of the
rwsem and the async thread is picking it up.

XFS plays the same trick in fs/xfs/xfs_aops.c.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01 15:19:07 -04:00
Josef Bacik
2aaa665581 Btrfs: add hole punching
This patch adds hole punching via fallocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:07 -04:00
Josef Bacik
2671485d39 Btrfs: remove unused hint byte argument for btrfs_drop_extents
I audited all users of btrfs_drop_extents and found that nobody actually uses
the hint_byte argument.  I'm sure it was used for something at some point but
it's not used now, and the way the pinning works the disk bytenr would never be
immediately useful anyway so lets just remove it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:06 -04:00
Liu Bo
d279440511 Btrfs: check if an inode has no checksum when logging it
This is based on Josef's "Btrfs: turbo charge fsync".

If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum
items any more.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:06 -04:00
Liu Bo
46d8bc3424 Btrfs: fix a bug in checking whether a inode is already in log
This is based on Josef's "Btrfs: turbo charge fsync".

The current btrfs checks if an inode is in log by comparing
root's last_log_commit to inode's last_sub_trans[2].

But the problem is that this root->last_log_commit is shared among
inodes.

Say we have N inodes to be logged, after the first inode,
root's last_log_commit is updated and the N-1 remained files will
be skipped.

This fixes the bug by keeping a local copy of root's last_log_commit
inside each inode and this local copy will be maintained itself.

[1]: we regard each log transaction as a subset of btrfs's transaction,
i.e. sub_trans

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:06 -04:00
Miao Xie
321f0e7022 Btrfs: fix wrong orphan count of the fs/file tree
If we add a new orphan item, we should increase the atomic counter,
not decrease it. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:05 -04:00
Liu Bo
4e2f84e63d Btrfs: improve fsync by filtering extents that we want
This is based on Josef's "Btrfs: turbo charge fsync".

The above Josef's patch performs very good in random sync write test,
because we won't have too much extents to merge.

However, it does not performs good on the test:
dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync

The reason is when we do sequencial sync write, we need to merge the
current extent just with the previous one, so that we can get accumulated
extents to log:

A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...

So we'll have to flush more and more checksum into log tree, which is the
bottleneck according to my tests.

But we can avoid this by telling fsync the real extents that are needed
to be logged.

With this, I did the above dd sync write test (size=50m),

         w/o (orig)   w/ (josef's)   w/ (this)
SATA      104KB/s       109KB/s       121KB/s
ramdisk   1.5MB/s       1.5MB/s       10.7MB/s (613%)

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:05 -04:00
Josef Bacik
ca7e70f590 Btrfs: do not needlessly restart the transaction for enospc
We will stop and restart a transaction every time we move to a different leaf
when truncating a file.  This is for enospc reasons, but really we could
probably get away with doing this a little better by actually working until we
hit an ENOSPC.  So add a ->failfast flag to the block_rsv and set it when we do
truncates which will fail as soon as the block rsv runs out of space, and then
at that point we can stop and restart the transaction and refill the block rsv
and carry on.  This will make rm'ing of a file with lots of extents a bit
faster.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:04 -04:00
Liu Bo
06d3d22b45 Btrfs: cleanup extents after we finish logging inode
This is based on Josef's "Btrfs: turbo charge fsync".

We should cleanup those extents after we've finished logging inode,
otherwise we may do redundant work on them.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:04 -04:00
Josef Bacik
0fa83cdb1d Btrfs: only warn if we hit an error when doing the tree logging
I hit this a couple times while working on my fsync patch (all my bugs, not
normal operation), but with my new stuff we could have new errors from cases
I have not encountered, so instead of BUG()'ing we should be WARN()'ing so
that we are notified there is a problem but the user doesn't lose their
data.  We can easily commit the transaction in the case that the tree
logging fails and still be fine, so let's try and be as nice to the user as
possible.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:03 -04:00
Josef Bacik
5dc562c541 Btrfs: turbo charge fsync
At least for the vm workload.  Currently on fsync we will

1) Truncate all items in the log tree for the given inode if they exist

and

2) Copy all items for a given inode into the log

The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing.  This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them.  We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already.  Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write

		Original	Patched
SATA drive	82KB/s		140KB/s
Fusion drive	431KB/s		2532KB/s

So around 2-6 times faster depending on your hardware.  There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok.  This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get.  All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.

The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok.  I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:03 -04:00
Josef Bacik
224ecce517 Btrfs: fix possible corruption when fsyncing written prealloced extents
While working on my fsync patch my fsync tester kept hitting mismatching
md5sums when I would randomly write to a prealloc'ed region, syncfs() and
then write to the prealloced region some more and then fsync() and then
immediately reboot.  This is because the tree logging code will skip writing
csums for file extents who's generation is less than the current running
transaction.  When we mark extents as written we haven't been updating their
generation so they were always being skipped.  This wouldn't happen if you
were to preallocate and then write in the same transaction, but if you for
example prealloced a VM you could definitely run into this problem.  This
patch makes my fsync tester happy again.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:02 -04:00
Josef Bacik
54338b5cc4 Btrfs: do not allocate chunks as agressively
Swinging this pendulum back the other way.  We've been allocating chunks up
to 2% of the disk no matter how much we actually have allocated.  So instead
fix this calculation to only allocate chunks if we have more than 80% of the
space available allocated.  Please test this as it will likely cause all
sorts of ENOSPC problems to pop up suddenly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:02 -04:00
Josef Bacik
7c735313bd Btrfs: update last trans if we don't update the inode
There is a completely impossible situation to hit where you can preallocate
a file, fsync it, write into the preallocated region, have the transaction
commit twice and then fsync and then immediately lose power and lose all of
the contents of the write.  This patch fixes this just so I feel better
about the situation and because it is lightweight, we just update the
last_trans when we finish an ordered IO and we don't update the inode
itself.  This way we are completely safe and I feel better.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:02 -04:00
Jan Schmidt
995e01b7af Btrfs: fix gcc warnings for 32bit compiles
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-01 15:19:01 -04:00
Chris Mason
74dd17fbe3 Btrfs: fix btrfs send for inline items and compression
The btrfs send code was assuming the offset of the file item into the
extent translated to bytes on disk.  If we're compressed, this isn't
true, and so it was off into extents owned by other files.

It was also improperly handling inline extents.  This solves a crash
where we may have gone past the end of the file extent item by not
testing early enough for an inline extent.  It also solves problems
where we have a whole between the end of the inline item and the start
of the full extent.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-01 15:19:00 -04:00
Alexander Block
6d85ed05e1 Btrfs: don't treat top/root directory inode as deleted/reused
We can't do the deleted/reused logic for top/root inodes as it would
create a stream that tries to delete and recreate the root dir.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:19:00 -04:00
Alexander Block
2981e225f7 Btrfs: ignore non-FS inodes for send/receive
We have to ignore inode/space cache objects in send/receive.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:59 -04:00
Alexander Block
2f28f4787c Btrfs: pass root instead of parent_root to iterate_inode_ref
We need to pass the root that we determined earlier to iterate_inode_ref.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:58 -04:00
Alexander Block
d8347fa444 Btrfs: use <= instead of < in is_extent_unchanged
Used the wrong compare operator here.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:58 -04:00
Alexander Block
3954096d4b Btrfs: fix check for changed extent in is_extent_unchanged
The previous check was working fine, but this check should be
easier to read. Also, we could theoritically have some exotic
bugs with the previous checks.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:57 -04:00
Alexander Block
5dc67d0ba9 Btrfs: free nce and nce_head on error in name_cache_insert
Both were leaked in case of error.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:56 -04:00
Alexander Block
3e126f32f8 Btrfs: remove unused tmp_path from iterate_dir_item
A leftover from older code and unused now.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:55 -04:00
Alexander Block
e938c8ad54 Btrfs: code cleanups for send/receive
Doing some code cleanups as suggested by Arne.
Changes do not change any logic.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:55 -04:00
Alexander Block
766702ef49 Btrfs: add/fix comments/documentation for send/receive
As the subject already said, add/fix comments.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:54 -04:00
Alexander Block
e479d9bb5f Btrfs: update send_progress at correct places
Updating send_progress in process_recorded_refs was not correct.
It got updated too early in the cur_inode_new_gen case.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:53 -04:00
Alexander Block
34d73f54e2 Btrfs: make aux field of ulist 64 bit
Btrfs send/receive uses the aux field to store inode numbers. On
32 bit machines this may become a problem.

Also fix all users of ulist_add and ulist_add_merged.

Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:53 -04:00
Alexander Block
7e0926fe5f Btrfs: fix use of radix_tree for name_cache in send/receive
We can't easily use the index of the radix tree for inums as the
radix tree uses 32bit indexes on 32bit kernels. For 32bit kernels,
we now use the lower 32bit of the inum as index and an additional
list to store multiple entries per radix tree entry.

Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:52 -04:00
Alexander Block
17589bd96e Btrfs: fix memory leak for name_cache in send/receive
When everything is done, name_cache_free is called which however
forgot to call kfree on the cache entries.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:51 -04:00
Alexander Block
adbe7fb6c4 Btrfs: don't break in the final loop of find_extent_clone
If we break, we may miss the clone from send_root which we prefer
over all other clones.

Commit is a result of Arne's review.

Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:50 -04:00
Alexander Block
52f9e53ede Btrfs: use normal return path for root == send_root case
Don't have a seperate return path for the mentioned case. Now
we do the same "take lowest inode/offset" logic for all found clones.

Commit is a result of Arne's review.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:50 -04:00
Alexander Block
35075bb046 Btrfs: use kmalloc instead of stack for backref_ctx
Make sure to never get in trouble due to the backref_ctx
which was on the stack before.

Commit is a result of Arne's review.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:49 -04:00
Alexander Block
ee849c0472 Btrfs: rename backref_ctx::found_in_send_root to found_itself
The new name should be easier to understand/read.

Commit is a result of Arne's review.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:48 -04:00
Alexander Block
d27aed5e24 Btrfs: remove unused use_list from send/receive code
use_list is a leftover and unused.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:48 -04:00
Alexander Block
ccf1626b49 Btrfs: add correct parent to check_dirs when dir got moved
We only added the parent for the new position of a moved dir.
We also need to add the old parent of the moved dir.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:47 -04:00
Alexander Block
9ea3ef516d Btrfs: remove unused code with #if 0
fs_path_remove is not used at the moment due to a previous patch.
Remove it for now (with #if 0) to avoid compile warnings.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:46 -04:00
Alexander Block
b9291affaa Btrfs: add missing check for dir != tmp_dir to is_first_ref
We missed that check which resultet in all refs with the same name
being reported as first_ref.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:45 -04:00
Alexander Block
1f4692da95 Btrfs: fix cur_ino < parent_ino case for send/receive
When the current inodes inum is smaller then the inum of the
parent directory strange things were happending due to wrong
path resolution and other bugs. Fix this with a new approach
for the problem.

Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:45 -04:00
Alexander Block
85a7b33b96 Btrfs: add rdev to get_inode_info in send/receive
We need rdev in the next commit.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-10-01 15:18:44 -04:00
Linus Torvalds
99dbb1632f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull the trivial tree from Jiri Kosina:
 "Tiny usual fixes all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  doc: fix old config name of kprobetrace
  fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
  btrfs: fix the commment for the action flags in delayed-ref.h
  btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
  vfs: fix kerneldoc for generic_fh_to_parent()
  treewide: fix comment/printk/variable typos
  ipr: fix small coding style issues
  doc: fix broken utf8 encoding
  nfs: comment fix
  platform/x86: fix asus_laptop.wled_type module parameter
  mfd: printk/comment fixes
  doc: getdelays.c: remember to close() socket on error in create_nl_socket()
  doc: aliasing-test: close fd on write error
  mmc: fix comment typos
  dma: fix comments
  spi: fix comment/printk typos in spi
  Coccinelle: fix typo in memdup_user.cocci
  tmiofb: missing NULL pointer checks
  tools: perf: Fix typo in tools/perf
  tools/testing: fix comment / output typos
  ...
2012-10-01 09:06:36 -07:00
Al Viro
2903ff019b switch simple cases of fget_light to fdget
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 22:20:08 -04:00
Al Viro
8319aa9127 switch btrfs_ioctl_clone() to fget_light()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:09 -04:00
Al Viro
ecd188159e switch btrfs_ioctl_snap_create_transid() to fget_light()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:07 -04:00
Eric W. Biederman
2f2f43d3c7 userns: Convert btrfs to use kuid/kgid where appropriate
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-09-21 03:13:31 -07:00
Wang Sheng-Hui
44a075bde9 btrfs: fix the commment for the action flags in delayed-ref.h
The action field has been merged into struct btrfs_delayed_ref_node,
and no struct btrfs_delayed_ref is available now.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-09-21 08:07:12 +02:00
Eric W. Biederman
5f3a4a28ec userns: Pass a userns parameter into posix_acl_to_xattr and posix_acl_from_xattr
- Pass the user namespace the uid and gid values in the xattr are stored
   in into posix_acl_from_xattr.

 - Pass the user namespace kuid and kgid values should be converted into
   when storing uid and gid values in an xattr in posix_acl_to_xattr.

- Modify all callers of posix_acl_from_xattr and posix_acl_to_xattr to
  pass in &init_user_ns.

In the short term this change is not strictly needed but it makes the
code clearer.  In the longer term this change is necessary to be able to
mount filesystems outside of the initial user namespace that natively
store posix acls in the linux xattr format.

Cc: Theodore Tso <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-09-18 01:01:35 -07:00
Linus Torvalds
6167f81fd1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull a btrfs revert from Chris Mason:
 "My for-linus branch has one revert in the new quota code.

  We're building up more fixes at etc for the next merge window, but I'm
  keeping them out unless they are bigger regressions or have a huge
  impact."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Revert "Btrfs: fix some error codes in btrfs_qgroup_inherit()"
2012-09-16 12:58:44 -07:00
Chris Mason
f3a87f1b0c Revert "Btrfs: fix some error codes in btrfs_qgroup_inherit()"
This reverts commit 5986802c2f.

Both paths are not error paths but regular cases where non-qgroup
subvols are involved.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-09-14 20:06:30 -04:00
Wang Sheng-Hui
527a136138 btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
It should be storing, not sotring.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-09-06 11:11:14 +02:00
Liu Bo
8bad3c0244 btrfs: fix comment typo in btrfs_finish_ordered_io
Fix typo errors in comments of btrfs_finish_ordered_io.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-09-01 08:27:57 -07:00
Linus Torvalds
318e151019 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I've split out the big send/receive update from my last pull request
  and now have just the fixes in my for-linus branch.  The send/recv
  branch will wander over to linux-next shortly though.

  The largest patches in this pull are Josef's patches to fix DIO
  locking problems and his patch to fix a crash during balance.  They
  are both well tested.

  The rest are smaller fixes that we've had queued.  The last rc came
  out while I was hacking new and exciting ways to recover from a
  misplaced rm -rf on my dev box, so these missed rc3."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
  Btrfs: fix that repair code is spuriously executed for transid failures
  Btrfs: fix ordered extent leak when failing to start a transaction
  Btrfs: fix a dio write regression
  Btrfs: fix deadlock with freeze and sync V2
  Btrfs: revert checksum error statistic which can cause a BUG()
  Btrfs: remove superblock writing after fatal error
  Btrfs: allow delayed refs to be merged
  Btrfs: fix enospc problems when deleting a subvol
  Btrfs: fix wrong mtime and ctime when creating snapshots
  Btrfs: fix race in run_clustered_refs
  Btrfs: don't run __tree_mod_log_free_eb on leaves
  Btrfs: increase the size of the free space cache
  Btrfs: barrier before waitqueue_active
  Btrfs: fix deadlock in wait_for_more_refs
  btrfs: fix second lock in btrfs_delete_delayed_items()
  Btrfs: don't allocate a seperate csums array for direct reads
  Btrfs: do not strdup non existent strings
  Btrfs: do not use missing devices when showing devname
  Btrfs: fix that error value is changed by mistake
  Btrfs: lock extents as we map them in DIO
  ...
2012-08-29 11:36:22 -07:00
Stefan Behrens
256dd1bb37 Btrfs: fix that repair code is spuriously executed for transid failures
If verify_parent_transid() fails for all mirrors, the current code
calls repair_io_failure() anyway which means:
- that the disk block is rewritten without repairing anything and
- that a kernel log message is printed which misleadingly claims
  that a read error was corrected.

This is an example:
parent transid verify failed on 615015833600 wanted 110423 found 110424
parent transid verify failed on 615015833600 wanted 110423 found 110424
btrfs read error corrected: ino 1 off 615015833600 (dev /dev/...)

It is wrong to ignore the results from verify_parent_transid() and to
call repair_eb_io_failure() when the verification of the transids failed.
This commit fixes the issue.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-08-28 16:53:43 -04:00
Liu Bo
d280e5be94 Btrfs: fix ordered extent leak when failing to start a transaction
We cannot just return error before freeing ordered extent and releasing reserved
space when we fail to start a transacion.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-08-28 16:53:42 -04:00
Liu Bo
24c03fa5cf Btrfs: fix a dio write regression
This bug is introduced by commit 3b8bde746f6f9bd36a9f05f5f3b6e334318176a9
(Btrfs: lock extents as we map them in DIO).

In dio write, we should unlock the section which we didn't do IO on in case that
we fall back to buffered write.  But we need to not only unlock the section
but also cleanup reserved space for the section.

This bug was found while running xfstests 133, with this 133 no longer complains.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-08-28 16:53:41 -04:00
Josef Bacik
bd7de2c9a4 Btrfs: fix deadlock with freeze and sync V2
We can deadlock with freeze right now because we unconditionally start a
transaction in our ->sync_fs() call.  To fix this just check and see if we
have a running transaction to commit.  This saves us from the deadlock
because at this point we'll have the umount sem for the sb so we're safe
from freezes coming in after we've done our check.  With this patch the
freeze xfstests no longer deadlocks.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-08-28 16:53:40 -04:00
Stefan Behrens
5ee0844d64 Btrfs: revert checksum error statistic which can cause a BUG()
Commit 442a4f6308 added btrfs device
statistic counters for detected IO and checksum errors to Linux 3.5.
The statistic part that counts checksum errors in
end_bio_extent_readpage() can cause a BUG() in a subfunction:
"kernel BUG at fs/btrfs/volumes.c:3762!"
That part is reverted with the current patch.
However, the counting of checksum errors in the scrub context remains
active, and the counting of detected IO errors (read, write or flush
errors) in all contexts remains active.

Cc: stable <stable@vger.kernel.org> # 3.5
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-08-28 16:53:39 -04:00
Stefan Behrens
68ce9682a4 Btrfs: remove superblock writing after fatal error
With commit acce952b0, btrfs was changed to flag the filesystem with
BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
error happened like a write I/O errors of all mirrors.
In such situations, on unmount, the superblock is written in
btrfs_error_commit_super(). This is done with the intention to be able
to evaluate the error flag on the next mount. A warning is printed
in this case during the next mount and the log tree is ignored.

The issue is that it is possible that the superblock points to a root
that was not written (due to write I/O errors).
The result is that the filesystem cannot be mounted. btrfsck also does
not start and all the other btrfs-progs tools fail to start as well.
However, mount -o recovery is working well and does the right things
to recover the filesystem (i.e., don't use the log root, clear the
free space cache and use the next mountable root that is stored in the
root backup array).

This patch removes the writing of the superblock when
BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
flag in the mount function.

These lines can be used to reproduce the issue (using /dev/sdm):
SCRATCH_DEV=/dev/sdm
SCRATCH_MNT=/mnt
echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup create foo
ls -alLF /dev/mapper/foo
mkfs.btrfs /dev/mapper/foo
mount /dev/mapper/foo $SCRATCH_MNT
echo bar > $SCRATCH_MNT/foo
sync
echo 0 25165824 error | dmsetup reload foo
dmsetup resume foo
ls -alF $SCRATCH_MNT
touch $SCRATCH_MNT/1
ls -alF $SCRATCH_MNT
sleep 35
echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup reload foo
dmsetup resume foo
sleep 1
umount $SCRATCH_MNT
btrfsck /dev/mapper/foo
dmsetup remove foo

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-08-28 16:53:38 -04:00
Josef Bacik
ae1e206b80 Btrfs: allow delayed refs to be merged
Daniel Blueman reported a bug with fio+balance on a ramdisk setup.
Basically what happens is the balance relocates a tree block which will drop
the implicit refs for all of its children and adds a full backref.  Once the
block is relocated we have to add the implicit refs back, so when we cow the
block again we add the implicit refs for its children back.  The problem
comes when the original drop ref doesn't get run before we add the implicit
refs back.  The delayed ref stuff will specifically prefer ADD operations
over DROP to keep us from freeing up an extent that will have references to
it, so we try to add the implicit ref before it is actually removed and we
panic.  This worked fine before because the add would have just canceled the
drop out and we would have been fine.  But the backref walking work needs to
be able to freeze the delayed ref stuff in time so we have this ever
increasing sequence number that gets attached to all new delayed ref updates
which makes us not merge refs and we run into this issue.

So to fix this we need to merge delayed refs.  So everytime we run a
clustered ref we need to try and merge all of its delayed refs.  The backref
walking stuff locks the delayed ref head before processing, so if we have it
locked we are safe to merge any refs inside of the sequence number.  If
there is no sequence number we can merge all refs.  Doing this not only
fixes our bug but keeps the delayed ref code from adding and removing
useless refs and batching together multiple refs into one search instead of
one search per delayed ref, which will really help our commit times.  I ran
this with Daniels test and 276 and I haven't seen any problems.  Thanks,

Reported-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:38 -04:00
Josef Bacik
5a24e84c55 Btrfs: fix enospc problems when deleting a subvol
Subvol delete is a special kind of awful where we use the global reserve to
cover the ENOSPC requirements.  The problem is once we're done removing
everything we do a btrfs_update_inode(), which by default will try to do the
delayed update stuff which will use it's own reserve.  There will be no
space in this reserve and we'll return ENOSPC.  So instead use
btrfs_update_inode_fallback() which will just fallback to updating the inode
item in the case of enospc.  This is fine because the global reserve covers
the space requirements for this.  With this patch I can now delete a subvol
on a problem image Dave Sterba sent me.  Thanks,

Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28 16:53:37 -04:00
Miao Xie
c0f62dedd0 Btrfs: fix wrong mtime and ctime when creating snapshots
When we created a new snapshot, the mtime and ctime of its parent directory
were not updated. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28 16:53:36 -04:00
Arne Jansen
22cd2e7de7 Btrfs: fix race in run_clustered_refs
With commit

commit d1270cd91f
Author: Arne Jansen <sensille@gmx.net>
Date:   Tue Sep 13 15:16:43 2011 +0200

     Btrfs: put back delayed refs that are too new

I added a window where the delayed_ref's head->ref_mod code can diverge
from the sum of the remaining refs, because we release the head->mutex
in the middle. This leads to btrfs_lookup_extent_info returning wrong
numbers. This patch fixes this by adjusting the head's ref_mod with each
delayed ref we run.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28 16:53:35 -04:00
Chris Mason
b12a3b1ea2 Btrfs: don't run __tree_mod_log_free_eb on leaves
When we split a leaf, we may end up inserting a new root on top of that
leaf.  The reflog code was incorrectly assuming the old root was always
a node.  This makes sure we skip over leaves.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28 16:53:34 -04:00
Josef Bacik
6fc823b10f Btrfs: increase the size of the free space cache
Arne was complaining about the space cache having mismatching generation
numbers when debugging a deadlock.  This is because we can run out of space
in our preallocated range for our space cache if you have a pretty
fragmented amount of space in your pinned space.  So just increase the
amount of space we preallocate for space cache so we can be sure to have
enough space.  This will only really affect data ranges since their the only
chunks that end up larger than 256MB.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28 16:53:34 -04:00
Josef Bacik
66657b318e Btrfs: barrier before waitqueue_active
We need a barrir before calling waitqueue_active otherwise we will miss
wakeups.  So in places that do atomic_dec(); then atomic_read() use
atomic_dec_return() which imply a memory barrier (see memory-barriers.txt)
and then add an explicit memory barrier everywhere else that need them.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:33 -04:00
Arne Jansen
1fa11e265f Btrfs: fix deadlock in wait_for_more_refs
Commit a168650c introduced a waiting mechanism to prevent busy waiting in
btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
where a tree_mod_seq is held while waiting for the io to complete, while
the end_io calls btrfs_run_delayed_refs.
This whole mechanism is unnecessary. If not enough runnable refs are
available to satisfy count, just return as count is more like a guideline
than a strict requirement.
In case we have to run all refs, commit transaction makes sure that no
other threads are working in the transaction anymore, so we just assert
here that no refs are blocked.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28 16:53:32 -04:00
Fengguang Wu
6209526531 btrfs: fix second lock in btrfs_delete_delayed_items()
Fix a real bug caught by coccinelle.

fs/btrfs/delayed-inode.c:1013:1-11: second lock on line 1013

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-08-28 16:53:31 -04:00
Josef Bacik
c329861da4 Btrfs: don't allocate a seperate csums array for direct reads
We've been allocating a big array for csums instead of storing them in the
io_tree like we do for buffered reads because previously we were locking the
entire range, so we didn't have an extent state for each sector of the
range.  But now that we do the range locking as we map the buffers we can
limit the mapping lenght to sectorsize and use the private part of the
io_tree for our csums.  This allows us to avoid an extra memory allocation
for direct reads which could incur latency.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:30 -04:00
Josef Bacik
99f5944b84 Btrfs: do not strdup non existent strings
When we close devices we add back empty devices for some reason that escapes
me.  In the case of a missing dev we don't allocate an rcu_string for it's
name, so check to see if the device has a name and if it doesn't don't
bother strdup()'ing it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:29 -04:00
Josef Bacik
aa9ddcd4b5 Btrfs: do not use missing devices when showing devname
If you do the following

mkfs.btrfs /dev/sdb /dev/sdc
rmmod btrfs
dd if=/dev/zero of=/dev/sdb bs=1M count=1
mount -o degraded /dev/sdc /mnt/btrfs-test

the box will panic trying to deref the name for the missing dev since it is
the lower numbered devid.  So fix show_devname to not use missing devices.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:29 -04:00
Stefan Behrens
3627bf4503 Btrfs: fix that error value is changed by mistake
In iterate_inodes_from_logical() the error result from
extent_from_logical() is patched by mistake. Typically ENOENT is
patched to EINVAL because (-ENOENT & BTRFS_EXTENT_FLAG_TREE_BLOCK)
evaluates to true.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-08-28 16:53:28 -04:00
Josef Bacik
eb838e73dc Btrfs: lock extents as we map them in DIO
A deadlock in xfstests 113 was uncovered by commit

d187663ef2

This is because we would not return EIOCBQUEUED for short AIO reads, instead
we'd wait for the DIO to complete and then return the amount of data we
transferred, which would allow our stuff to unlock the remaning amount.  But
with this change this no longer happens, so if we have a short AIO read (for
example if we try to read past EOF), we could leave the section from EOF to
the end of where we tried to read locked.  Fixing this is tricky since there
is no clear way to know exactly how much data DIO truly submitted for IO, so
to make this less hard on ourselves and less combersome we need to lock the
extents as we try to map them, and then we unlock any areas we didn't
actually map.  This makes us completely safe from deadlocks and reliance on
a particular behavior of the DIO code.  This also lays the groundwork for
allowing us to use the normal csum storage method for reads which means we
can remove an allocation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:27 -04:00
Dan Carpenter
dadd1105ca Btrfs: fix some endian bugs handling the root times
"trans->transid" is cpu endian but we want to store the data as little
endian.  "item->ctime.nsec" is only 32 bits, not 64.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-08-28 16:53:26 -04:00
Dan Carpenter
55e591ffde Btrfs: unlock on error in btrfs_delalloc_reserve_metadata()
We should release this mutex before returning the error code.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-08-28 16:53:25 -04:00
Dan Carpenter
57a5a88203 Btrfs: checking for NULL instead of IS_ERR
add_qgroup_rb() never returns NULL, only error pointers.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-08-28 16:53:25 -04:00
Dan Carpenter
5986802c2f Btrfs: fix some error codes in btrfs_qgroup_inherit()
These are returning zero when it should be returning a negative error
code.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-08-28 16:53:24 -04:00
Stefan Behrens
aa2ffd0616 Btrfs: fix a misplaced address operator in a condition
This should obviously not be "if (&flag)" but "if (flag)".

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-08-28 16:53:23 -04:00
Linus Torvalds
15fc5deb1f Merge branch 'for-linus-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs merge fix from Chris Mason:
 "This fixes a merge error in rc1.  The calls to mnt_want_write should
  have been removed."

* 'for-linus-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: remove mnt_want_write call in btrfs_mksubvol
2012-08-12 21:28:41 +03:00
Alexander Block
e00da2067b Btrfs: remove mnt_want_write call in btrfs_mksubvol
We got a recursive lock in mksubvol because the caller already held
a lock. I think we got into this due to a merge error. Commit a874a63
removed the mnt_want_write call from btrfs_mksubvol and added a
replacement call to mnt_want_write_file in btrfs_ioctl_snap_create_transid.
Commit e7848683 however tried to move all calls to mnt_want_write above
i_mutex. So somewhere while merging this, it got mixed up. The
solution is to remove the mnt_want_write call completely from
mksubvol.

Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-09 11:01:54 -04:00
Artem Bityutskiy
b257031408 btrfs: nuke pdflush from comments
The pdflush thread is long gone, so this patch removes references to pdflush
from btrfs comments.

Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-04 12:15:35 +04:00
Artem Bityutskiy
34eaadaf22 btrfs: nuke write_super from comments
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from btrfs.

Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-04 12:15:35 +04:00
Linus Torvalds
a0e881b7c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...
2012-08-01 10:26:23 -07:00
Jan Kara
b2b5ef5c8e btrfs: Convert to new freezing mechanism
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.

Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 09:45:52 +04:00
Joe Perches
533574c6bc btrfs: use printk_get_level and printk_skip_level, add __printf, fix fallout
Use the generic printk_get_level() to search a message for a kern_level.

Add __printf to verify format and arguments.  Fix a few messages that
had mismatches in format and arguments.  Add #ifdef CONFIG_PRINTK blocks
to shrink the object size a bit when not using printk.

[akpm@linux-foundation.org: whitespace tweak]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:14 -07:00
Jan Kara
e7848683ae btrfs: Push mnt_want_write() outside of i_mutex
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:51 +04:00
Stephen Rothwell
a1857ebe75 Btrfs: using vmalloc and friends needs vmalloc.h
On powerpc, we don't get the implicit vmalloc.h include, and as a result
the build fails noisily:

  fs/btrfs/send.c: In function 'fs_path_free':
  fs/btrfs/send.c:185:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
  fs/btrfs/send.c: In function 'fs_path_ensure_buf':
  fs/btrfs/send.c:215:4: error: implicit declaration of function 'vmalloc' [-Werror=implicit-function-declaration]
  fs/btrfs/send.c:215:12: warning: assignment makes pointer from integer without a cast [enabled by default]
  fs/btrfs/send.c:225:12: warning: assignment makes pointer from integer without a cast [enabled by default]
  fs/btrfs/send.c:233:13: warning: assignment makes pointer from integer without a cast [enabled by default]
  fs/btrfs/send.c: In function 'iterate_dir_item':
  fs/btrfs/send.c:900:10: warning: assignment makes pointer from integer without a cast [enabled by default]
  fs/btrfs/send.c:909:11: warning: assignment makes pointer from integer without a cast [enabled by default]
  fs/btrfs/send.c: In function 'btrfs_ioctl_send':
  fs/btrfs/send.c:4463:17: warning: assignment makes pointer from integer without a cast [enabled by default]
  fs/btrfs/send.c:4469:17: warning: assignment makes pointer from integer without a cast [enabled by default]
  fs/btrfs/send.c:4475:2: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
  fs/btrfs/send.c:4475:20: warning: assignment makes pointer from integer without a cast [enabled by default]
  fs/btrfs/send.c:4483:21: warning: assignment makes pointer from integer without a cast [enabled by default]

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-26 18:08:30 -07:00
Linus Torvalds
e2aed8dfa5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull large btrfs update from Chris Mason:
 "This pull request is very large, and the two main features in here
  have been under testing/devel for quite a while.

  We have subvolume quotas from the strato developers.  This enables
  full tracking of how many blocks are allocated to each subvolume (and
  all snapshots) and you can set limits on a per-subvolume basis.  You
  can also create quota groups and toss multiple subvolumes into a big
  group.  It's everything you need to be a web hosting company and give
  each user their own subvolume.

  The userland side of the quotas is being refreshed, they'll send out
  details on where to grab it soon.

  Next is the kernel side of btrfs send/receive from Alexander Block.
  This leverages the same infrastructure as the quota code to figure out
  relationships between blocks and their owners.  It can then compute
  the difference between two snapshots and sends the diffs in a neutral
  format into userland.

  The basic model:

        create a snapshot
        send that snapshot as the initial backup
        make changes
        create a second snapshot
        send the incremental as a backup
        delete the first snapshot
        (use the second snapshot for the next incremental)

  The receive portion is all in userland, and in the 'next' branch of my
  btrfs-progs repo.

  There's still some work to do in terms of optimizing the send side
  from kernel to userland.  The really important part is figuring out
  how two snapshots are different, and this is where we are
  concentrating right now.  The initial send of a dataset is a little
  slower than tar, but the incremental sends are dramatically faster
  than what rsync can do.

  On top of all of that, we have a nice queue of fixes, cleanups and
  optimizations."

Fix up trivial modify/del conflict in fs/btrfs/ioctl.c

Also fix up semantic conflict in fs/btrfs/send.c: the interface to
dentry_open() changed in commit 765927b2d5 ("switch dentry_open() to
struct path, make it grab references itself"), and since it now grabs
whatever references it needs, we should no longer do the mntget() on the
mnt (and we need to dput() the dentry reference we took).

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits)
  Btrfs: uninit variable fixes in send/receive
  Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
  Btrfs: add btrfs_compare_trees function
  Btrfs: introduce subvol uuids and times
  Btrfs: make iref_to_path non static
  Btrfs: add a barrier before a waitqueue_active check
  Btrfs: call the ordered free operation without any locks held
  Btrfs: Check INCOMPAT flags on remount and add helper function
  Btrfs: add helper for tree enumeration
  btrfs: allow cross-subvolume file clone
  Btrfs: improve multi-thread buffer read
  Btrfs: make btrfs's allocation smoothly with preallocation
  Btrfs: lock the transition from dirty to writeback for an eb
  Btrfs: fix potential race in extent buffer freeing
  Btrfs: don't return true in releasepage unless we actually freed the eb
  Btrfs: suppress printk() if all device I/O stats are zero
  Btrfs: remove unwanted printk() for btrfs device I/O stats
  Btrfs: rewrite BTRFS_SETGET_FUNCS
  Btrfs: zero unused bytes in inode item
  Btrfs: kill free_space pointer from inode structure
  ...

Conflicts:
	fs/btrfs/ioctl.c
2012-07-26 14:48:55 -07:00
Chris Mason
b24baf6917 Btrfs: uninit variable fixes in send/receive
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 19:21:10 -04:00
Chris Mason
113c1cb530 Merge branch 'send-v2' of git://github.com/ablock84/linux-btrfs into for-linus
This is the kernel portion of btrfs send/receive

Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/backref.h
	fs/btrfs/ctree.c
	fs/btrfs/ioctl.c
	fs/btrfs/ioctl.h

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 19:19:10 -04:00
Alexander Block
31db9f7c23 Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
This patch introduces the BTRFS_IOC_SEND ioctl that is
required for send. It allows btrfs-progs to implement
full and incremental sends. Patches for btrfs-progs will
follow.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
2012-07-25 23:30:19 +02:00
Alexander Block
7069830a9e Btrfs: add btrfs_compare_trees function
This function is used to find the differences between
two trees. The tree compare skips whole subtrees if it
detects shared tree blocks and thus is pretty fast.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
2012-07-25 23:30:14 +02:00
Alexander Block
8ea05e3a42 Btrfs: introduce subvol uuids and times
This patch introduces uuids for subvolumes. Each
subvolume has it's own uuid. In case it was snapshotted,
it also contains parent_uuid. In case it was received,
it also contains received_uuid.

It also introduces subvolume ctime/otime/stime/rtime. The
first two are comparable to the times found in inodes. otime
is the origin/creation time and ctime is the change time.
stime/rtime are only valid on received subvolumes.
stime is the time of the subvolume when it was
sent. rtime is the time of the subvolume when it was
received.

Additionally to the times, we have a transid for each
time. They are updated at the same place as the times.

btrfs receive uses stransid and rtransid to find out
if a received subvolume changed in the meantime.

If an older kernel mounts a filesystem with the
extented fields, all fields become invalid. The next
mount with a new kernel will detect this and reset the
fields.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
2012-07-25 23:28:38 +02:00
Alexander Block
91cb916ca2 Btrfs: make iref_to_path non static
Make iref_to_path non static (needed in send) and rename
it to btrfs_iref_to_path

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-07-25 23:25:06 +02:00
Chris Mason
cd1cfc4915 Btrfs: add a barrier before a waitqueue_active check
We were missing wakeups on the delayed ref waitqueue due
to races on waitqueue_active.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 16:15:08 -04:00
Chris Mason
e9fbcb4220 Btrfs: call the ordered free operation without any locks held
Each ordered operation has a free callback, and this was called with the
worker spinlock held.  Josef made the free callback also call iput,
which we can't do with the spinlock.

This drops the spinlock for the free operation and grabs it again before
moving through the rest of the list.  We'll circle back around to this
and find a cleaner way that doesn't bounce the lock around so much.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
cc: stable@kernel.org
2012-07-25 16:15:07 -04:00
Mitch Harder
2b0ce2c290 Btrfs: Check INCOMPAT flags on remount and add helper function
In support of the recently added capability to remount with lzo
compression, provide a helper function to check the compression
INCOMPAT flags when remounting with lzo compression, and set
the flags if necessary.

Also, implement the new helper function when defragmenting with
explicit lzo compression and when setting the default subvolume.

Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 16:14:31 -04:00
Chris Mason
b478b2baa3 Merge branch 'qgroup' of git://git.jan-o-sch.net/btrfs-unstable into for-linus
Conflicts:
	fs/btrfs/ioctl.c
	fs/btrfs/ioctl.h
	fs/btrfs/transaction.c
	fs/btrfs/transaction.h

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 16:11:38 -04:00
Arne Jansen
e679376911 Btrfs: add helper for tree enumeration
Often no exact match is wanted but just the next lower or
higher item. There's a lot of duplicated code throughout
btrfs to deal with the corner cases. This patch adds a
helper function that can facilitate searching.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-25 17:33:18 +02:00
David Sterba
362a20c5e2 btrfs: allow cross-subvolume file clone
Lift the EXDEV condition and allow different root trees for files being
cloned, then pass source inode's root when searching for extents.
Cloning is not allowed to cross vfsmounts, ie. when two subvolumes from
one filesystem are mounted separately.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-07-25 17:33:09 +02:00
Linus Torvalds
d14b7a419a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "Trivial updates all over the place as usual."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
  Fix typo in include/linux/clk.h .
  pci: hotplug: Fix typo in pci
  iommu: Fix typo in iommu
  video: Fix typo in drivers/video
  Documentation: Add newline at end-of-file to files lacking one
  arm,unicore32: Remove obsolete "select MISC_DEVICES"
  module.c: spelling s/postition/position/g
  cpufreq: Fix typo in cpufreq driver
  trivial: typo in comment in mksysmap
  mach-omap2: Fix typo in debug message and comment
  scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
  Change email address for Steve Glendinning
  Btrfs: fix typo in convert_extent_bit
  via: Remove bogus if check
  netprio_cgroup.c: fix comment typo
  backlight: fix memory leak on obscure error path
  Documentation: asus-laptop.txt references an obsolete Kconfig item
  Documentation: ManagementStyle: fixed typo
  mm/vmscan: cleanup comment error in balance_pgdat
  mm: cleanup on the comments of zone_reclaim_stat
  ...
2012-07-24 13:34:56 -07:00
Liu Bo
67c9684f48 Btrfs: improve multi-thread buffer read
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.

Here is a scenario in fio jobs:

We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.

And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:

     t1          t2          t3          t4
   add Page1
   read Page1  add Page2
     |         read Page2  add Page3
     |            |        read Page3  add Page4
     |            |           |        read Page4
-----|------------|-----------|-----------|--------
     v            v           v           v
    bio          bio         bio         bio

Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio.  Thus, we can end up with far more bios
than we need.

Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.

With that said, we can make each bio hold more pages and reduce the number
of bios we need.

Here is some numbers taken from fio results:
         w/o patch                 w patch
       -------------  --------  ---------------
READ:    745MB/s        +25%       934MB/s

[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/

[READ]
filename=foobar
size=2000M
invalidate=1

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:10 -04:00
Liu Bo
df57dbe6bf Btrfs: make btrfs's allocation smoothly with preallocation
For backref walking, we've introduce delayed ref's sequence.  However,
it changes our preallocation behavior.

The story is that when we preallocate an extent and then mark it written
piece by piece, the ideal case should be that we don't need to COW the
extent, which is why we use 'preallocate'.

But we may not make use of preallocation, since when we check for cross refs on
the extent, we may have two ref entries which have the same content except
the sequence value, and we recognize them as cross refs and do COW to allocate
another extent.

So we end up with several pieces of space instead of an whole extent.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:10 -04:00
Josef Bacik
51561ffec9 Btrfs: lock the transition from dirty to writeback for an eb
There is a small window where an eb can have no IO bits set on it, which
could potentially result in extent_buffer_under_io() returning false when we
want it to return true, which could result in not fun things happening.  So
in order to protect this case we need to hold the refs_lock when we make
this transition to make sure we get reliable results out of
extent_buffer_udner_io().  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:09 -04:00
Josef Bacik
594831c4b2 Btrfs: fix potential race in extent buffer freeing
This sounds sort of impossible but it is the only thing I can think of and
at the very least it is theoretically possible so here it goes.

If we are in try_release_extent_buffer we will check that the ref count on
the extent buffer is 1 and not under IO, and then go down and clear the tree
ref.  If between this check and clearing the tree ref somebody else comes in
and grabs a ref on the eb and the marks it dirty before
try_release_extent_buffer() does it's tree ref clear we can end up with a
dirty eb that will be freed while it is still dirty which will result in a
panic.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:09 -04:00
Josef Bacik
e64860aa05 Btrfs: don't return true in releasepage unless we actually freed the eb
I noticed while looking at an extent_buffer race that we will
unconditionally return 1 if we get down to release_extent_buffer after
clearing the tree ref.  However we can easily race in here and get a ref on
the eb and not actually free the eb.  So make release_extent_buffer return 1
if it free'd the eb and 0 if not so we can be a little kinder to the vm.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:08 -04:00
Stefan Behrens
a98cdb85b9 Btrfs: suppress printk() if all device I/O stats are zero
Code is added to suppress the I/O stats printing at mount time if all
statistic values are zero.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-07-23 16:28:07 -04:00
Stefan Behrens
5021976d8d Btrfs: remove unwanted printk() for btrfs device I/O stats
People complained about the annoying kernel log message
"btrfs: no dev_stats entry found ... (OK on first mount after mkfs)"
everytime a filesystem is mounted for the first time after running
mkfs. Since the distribution of the btrfs-progs is not synchronized
to the kernel version, mkfs like it is now will be used also in the
future. Then this message is not useful to find errors, it is just
annoying. This commit removes the printk().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-07-23 16:28:07 -04:00
Li Zefan
18077bb413 Btrfs: rewrite BTRFS_SETGET_FUNCS
BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and
btrfs_foo() functions, which read and write specific fields in the
extent buffer.

The total number of set/get functions is ~200, but in fact we only
need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64.

It results in redunction of ~37K bytes.

   text    data     bss     dec     hex filename
 629661   12489     216  642366   9cd3e fs/btrfs/btrfs.o.orig
 592637   12489     216  605342   93c9e fs/btrfs/btrfs.o

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:06 -04:00
Li Zefan
293f7e0740 Btrfs: zero unused bytes in inode item
The otime field is not zeroed, so users will see random otime in an old
filesystem with a new kernel which has otime support in the future.

The reserved bytes are also not zeroed, and we'll have compatibility
issue if we make use of those bytes.

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:05 -04:00
Li Zefan
b4d7c3c945 Btrfs: kill free_space pointer from inode structure
Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
which means every inode has the same BTRFS_I(inode)->free_space pointer.

This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:05 -04:00
Anand Jain
d5b025d510 btrfs read error corrected message floods the console during recovery
Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:04 -04:00
Jan Schmidt
e6466e354a Btrfs: fix buffer leak in btrfs_next_old_leaf
When calling btrfs_next_old_leaf, we were leaking an extent buffer in the
rare case of using the deadlock avoidance code needed for the tree mod log.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:03 -04:00
Liu Bo
f6175efab1 Btrfs: do not count in readonly bytes
If a block group is ro, do not count its entries in when we dump space info.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:03 -04:00
Liu Bo
799ffc3c31 Btrfs: add ro notification to dump_space_info
Block group has ro attributes, make dump_space_info show it.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:02 -04:00
Liu Bo
cf7c1ef6e1 Btrfs: fix a bug of writting free space cache during balance
Here is the whole story:
1)
A free space cache consists of two parts:
o  free space cache inode, which is special becase it's stored in root tree.
o  free space info, which is stored as the above inode's file data.

But we only build up another new inode and does not flush its free space info
onto disk when we _clear and setup_ free space cache, and this ends up with
that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.

And holding DC_SETUP means that we will not truncate this free space cache inode,
which means the disk offset of its file extent will remain _unchanged_ at least
until next transaction finishes committing itself.

2)
We can set a block group readonly when we relocate the block group.

However,
if the readonly block group covers the disk offset where our free space cache
inode is going to write, it will force the free space cache inode into
cow_file_range() and it'll end up hitting a BUG_ON.

3)
Due to the above analysis, we fix this bug by adding the missing dirty flag.

4)
However, it's not over, there is still another case, nospace_cache.

With nospace_cache, we do not want to set dirty flag, instead we just truncate
free space cache inode and bail out with setting cache state DC_WRITTEN.

We can benifit from it since it saves us another 'pre-allocation' part which
usually costs a lot.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:02 -04:00
Liu Bo
0678938423 Btrfs: do not abort transaction in prealloc case
During disk balance, we prealloc new file extent for file data relocation,
but we may fail in 'no available space' case, and it leads to flipping btrfs
into readonly.

It is not necessary to bail out and abort transaction since we do have several
ways to rescue ourselves from ENOSPC case.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:01 -04:00
Liu Bo
83eea1f1ba Btrfs: kill root from btrfs_is_free_space_inode
Since root can be fetched via BTRFS_I macro directly, we can save an args
for btrfs_is_free_space_inode().

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:00 -04:00
Liu Bo
51a8cf9d2d Btrfs: fix btrfs_is_free_space_inode to recognize btree inode
For btree inode, its root is also 'tree root', so btree inode can be
misunderstood as a free space inode.

We should add one more check for btree inode.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:00 -04:00
Stefan Behrens
c0901581ad Btrfs: avoid I/O repair BUG() from btree_read_extent_buffer_pages()
From btree_read_extent_buffer_pages(), currently repair_io_failure()
can be called with mirror_num being zero when submit_one_bio() returned
an error before. This used to cause a BUG_ON(!mirror_num) in
repair_io_failure() and indeed this is not a case that needs the I/O
repair code to rewrite disk blocks.
This commit prevents calling repair_io_failure() in this case and thus
avoids the BUG_ON() and malfunction.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:59 -04:00
Josef Bacik
f4c738c2e7 Btrfs: rework shrink_delalloc
So shrink_delalloc has grown all sorts of cruft over the years thanks to
many reworkings of how we track enospc.  What happens now as we fill up the
disk is we will loop for freaking ever hoping to reclaim a arbitrary amount
of space of metadata, this was from when everybody flushed at the same time.
Now we only have people flushing one at a time.  So instead of trying to
reclaim a huge amount of space, just try to flush a decent chunk of space,
and stop looping as soon as we have enough free space to satisfy our
reservation.  This makes xfstests 224 go much faster.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:58 -04:00
Liu Bo
b9ca0664dc Btrfs: do not set subvolume flags in readonly mode
$ mkfs.btrfs /dev/sdb7
$ btrfstune -S1 /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
mount: block device /dev/sdb7 is write-protected, mounting read-only
$ btrfs dev add /dev/sdb8 /mnt/btrfs/

Now we get a btrfs in which mnt flags has readonly but sb flags does
not.  So for those ioctls that only check sb flags with MS_RDONLY, it
is going to be a problem.
Setting subvolume flags is such an ioctl, we should use mnt_want_write_file()
to check RO flags.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:58 -04:00
Liu Bo
e54bfa3104 Btrfs: use mnt_want_write_file instead of mnt_want_write
mnt_want_write_file is faster when file has been opened for write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:57 -04:00
Liu Bo
768e9dfe82 Btrfs: remove redundant r/o check for superblock
mnt_want_write() and mnt_want_write_file() will check sb->s_flags with
MS_RDONLY, and we don't need to do it ourselves.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
Liu Bo
a874a63e13 Btrfs: check write access to mount earlier while creating snapshots
Move check of write access to mount into upper functions so that we can
use mnt_want_write_file instead, which is faster than mnt_want_write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
Liu Bo
287082b0bd Btrfs: fix typo in cow_file_range_async and async_cow_submit
It should be 10 * 1024 * 1024.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-23 16:27:55 -04:00
Josef Bacik
0e72110692 Btrfs: change how we indicate we're adding csums
There is weird logic I had to put in place to make sure that when we were
adding csums that we'd used the delalloc block rsv instead of the global
block rsv.  Part of this meant that we had to free up our transaction
reservation before we ran the delayed refs since csum deletion happens
during the delayed ref work.  The problem with this is that when we release
a reservation we will add it to the global reserve if it is not full in
order to keep us going along longer before we have to force a transaction
commit.  By releasing our reservation before we run delayed refs we don't
get the opportunity to drain down the global reserve for the work we did, so
we won't refill it as often.  This isn't a problem per-se, it just results
in us possibly committing transactions more and more often, and in rare
cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
out of space in our block rsv.

This also helps us by holding onto space while the delayed refs run so we
don't end up with as many people trying to do things at the same time, which
again will help us not force commits or hit the use_block_rsv warnings.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:55 -04:00
Tsutomu Itoh
b995929515 Btrfs: return error of btrfs_update_inode() to caller
We didn't check error of btrfs_update_inode(), but that error looks
easy to bubble back up.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:54 -04:00
Dan Carpenter
23291a044c Btrfs: fix error handling in __add_reloc_root()
We dereferenced "node" in the error message after freeing it.  Also
btrfs_panic() can return so we should return an error code instead of
continuing.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-23 16:27:53 -04:00
Ilya Dryomov
44c44af2f4 Btrfs: do not ignore errors from btrfs_cleanup_fs_roots() when mounting
There used to be a BUG_ON(ret) there before EH patch (79787eaa) went in.
Bail out with EINVAL.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-23 16:27:53 -04:00
Ilya Dryomov
fed425c742 Btrfs: do not return EINVAL instead of ENOMEM from open_ctree()
When bailing from open_ctree() err is returned, not ret.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-23 16:27:52 -04:00
Josef Bacik
02db0844be Btrfs: add DEVICE_READY ioctl
This will be used in conjunction with btrfs device ready <dev>.  This is
needed for initrd's to have a nice and lightweight way to tell if all of the
devices needed for a file system are in the cache currently.  This keeps
them from having to do mount+sleep loops waiting for devices to show up.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:42 -04:00
Josef Bacik
96c3f4331a Btrfs: flush delayed inodes if we're short on space
Those crazy gentoo guys have been complaining about ENOSPC errors on their
portage volumes.  This is because doing things like untar tends to create
lots of new files which will soak up all the reservation space in the
delayed inodes.  Usually this gets papered over by the fact that we will try
and commit the transaction, however if this happens in the wrong spot or we
choose not to commit the transaction you will be screwed.  So add the
ability to expclitly flush delayed inodes to free up space.  Please test
this out guys to make sure it works since as usual I cannot reproduce.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 15:41:40 -04:00
David Sterba
b27f7c0c15 btrfs: join DEV_STATS ioctls to one
Commit c11d2c236c (Btrfs: add ioctl to get and reset the device
stats) introduced two ioctls doing almost the same thing distinguished
by just the ioctl number which encodes "do reset after read". I have
suggested

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html

to implement it via the ioctl args. This hasn't happen, and I think we
should use a more clean way to pass flags and should not waste ioctl
numbers.

CC: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-07-23 15:41:40 -04:00
Andrew Mahone
a43a211133 btrfs: ignore unfragmented file checks in defrag when compression enabled - rebased
Rebased on btrfs-next and retested.

Inform should_defrag_range if BTRFS_DEFRAG_RANGE_COMPRESS is set. If so, skip
checks for adjacent extents and extent size when deciding whether to defrag,
as these can prevent an uncompressed and unfragmented file from being
compressed as requested.

Signed-off-by: Andrew Mahone <andrew.mahone@gmail.com>
2012-07-23 15:41:39 -04:00
Dan Carpenter
e4b50e14c8 Btrfs: small naming cleanup in join_transaction()
"root->fs_info" and "fs_info" are the same, but "fs_info" is prefered
because it is shorter and that's what is used in the rest of the
function.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-23 15:41:39 -04:00
Alexander Block
2bc5565286 Btrfs: don't update atime on RO subvolumes
Before the update_time inode operation was indroduced, it was
not possible to prevent updates of atime on RO subvolumes. VFS
was only able to check for RO on the mount, but did not know
anything about btrfs subvolumes.

btrfs_update_time does now check if the root is RO and skip
updating of times.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-07-23 15:41:38 -04:00
Arnd Hannemann
063849eafd Btrfs: allow mount -o remount,compress=no
Btrfs allows to turn on compression on a mounted and used filesystem
by issuing mount -o remount,compress=lzo.
This patch allows to turn compression off again
while the filesystem is mounted. As suggested by David Sterba
if the compress-force option was set, it is implicitly cleared
if compression is turned off.

Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
2012-07-23 15:41:38 -04:00
Josef Bacik
c5c3c5f31e Btrfs: remove ->dirty_inode
We do all of our inode updating when we change it, and now that we do
->update_time we don't need ->dirty_inode for atime updates anymore, so just
remove it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-07-23 15:41:38 -04:00
Chris Mason
cbea5ac1ee Btrfs: reduce calls to wake_up on uncontended locks
The btrfs locks were unconditionally calling wake_up as the
locks were released.  This lead to extra thrashing on the waitqueue,
especially for locks that were dominated by readers.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:18 -04:00
Chris Mason
e39e64ac0c Btrfs: don't wait around for new log writers on an SSD
Waiting on spindles improves performance, but ssds want all the
IO as quickly as we can push it down.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:17 -04:00
Al Viro
11e62a8fab btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:01:43 +04:00
David Howells
9249e17fe0 VFS: Pass mount flags to sget()
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called.  They could also be passed to the
compare function.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:38:34 +04:00
Al Viro
ebfc3b49a7 don't pass nameidata to ->create()
boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:47 +04:00
Al Viro
00cd8dd3bf stop passing nameidata to ->lookup()
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:32 +04:00
Al Viro
b3d9b7a3c7 vfs: switch i_dentry/d_alias to hlist
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:32:55 +04:00
Liu Bo
10983f2e8d Btrfs: fix typo in convert_extent_bit
It should be convert_extent_bit.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-12 11:27:34 +02:00
Arne Jansen
6f72c7e20d Btrfs: add qgroup inheritance
When creating a subvolume or snapshot, it is necessary
to initialize the qgroup account with a copy of some
other (tracking) qgroup. This patch adds parameters
to the ioctls to pass the information from which qgroup
to inherit.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-12 10:54:40 +02:00
Arne Jansen
5d13a37bd5 Btrfs: add qgroup ioctls
Ioctls to control the qgroup feature like adding and
removing qgroups and assigning qgroups.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-12 10:54:39 +02:00
Arne Jansen
c556723794 Btrfs: hooks to reserve qgroup space
Like block reserves, reserve a small piece of space on each
transaction start and for delalloc. These are the hooks that
can actually return EDQUOT to the user.
The amount of space reserved is tracked in the transaction
handle.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-12 10:54:39 +02:00
Jan Schmidt
546adb0d81 Btrfs: hooks for qgroup to record delayed refs
Hooks into qgroup code to record refs and into transaction commit.
This is the main entry point for qgroup. Basically every change in
extent backrefs got accounted to the appropriate qgroups.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-07-12 10:54:38 +02:00
Arne Jansen
bcef60f249 Btrfs: quota tree support and startup
Init the quota tree along with the others on open_ctree
and close_ctree. Add the quota tree to the list of well
known trees in btrfs_read_fs_root_no_name.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-12 10:54:38 +02:00
Jan Schmidt
edf39272db Btrfs: call the qgroup accounting functions
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-07-12 10:54:37 +02:00
Arne Jansen
bed92eae26 Btrfs: qgroup implementation and prototypes
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-07-12 10:54:21 +02:00
Arne Jansen
709c0486b9 Btrfs: Test code to change the order of delayed-ref processing
Normally delayed refs get processed in ascending bytenr order. This
correlates in most cases to the order added. To expose dependencies
on this order, we start to process the tree in the middle instead of
the beginning.
This code is only effective when SCRAMBLE_DELAYED_REFS is defined.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-10 15:14:44 +02:00
Arne Jansen
416ac51da9 Btrfs: qgroup state and initialization
Add state to fs_info.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-10 15:14:44 +02:00
Arne Jansen
20897f5c86 Btrfs: added helper to create new trees
This creates a brand new tree. Will be used to create
the quota tree.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-10 15:14:43 +02:00
Arne Jansen
d13603ef6e Btrfs: check the root passed to btrfs_end_transaction
This patch only add a consistancy check to validate that the
same root is passed to start_transaction and end_transaction.
Subvolume quota depends on this.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-10 15:14:43 +02:00
Arne Jansen
2f38b3e190 Btrfs: add helper for tree enumeration
Often no exact match is wanted but just the next lower or
higher item. There's a lot of duplicated code throughout
btrfs to deal with the corner cases. This patch adds a
helper function that can facilitate searching.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-10 15:14:42 +02:00
Arne Jansen
630dc772ea Btrfs: qgroup on-disk format
Not all features are in use by the current version
and thus may change in the future.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-10 15:14:42 +02:00
Jan Schmidt
097b8a7c9e Btrfs: join tree mod log code with the code holding back delayed refs
We've got two mechanisms both required for reliable backref resolving (tree
mod log and holding back delayed refs). You cannot make use of one without
the other. So instead of requiring the user of this mechanism to setup both
correctly, we join them into a single interface.

Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list
as we did before, which was of no value.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-07-10 15:14:41 +02:00
Jan Schmidt
cf5388307a Btrfs: fix buffer leak in btrfs_next_old_leaf
When calling btrfs_next_old_leaf, we were leaking an extent buffer in the
rare case of using the deadlock avoidance code needed for the tree mod log.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-07-10 15:14:41 +02:00
Linus Torvalds
5eecb9cc90 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "I held off on my rc5 pull because I hit an oops during log recovery
  after a crash.  I wanted to make sure it wasn't a regression because
  we have some logging fixes in here.

  It turns out that a commit during the merge window just made it much
  more likely to trigger directory logging instead of full commits,
  which exposed an old bug.

  The new backref walking code got some additional fixes.  This should
  be the final set of them.

  Josef fixed up a corner where our O_DIRECT writes and buffered reads
  could expose old file contents (not stale, just not the most recent).
  He and Liu Bo fixed crashes during tree log recover as well.

  Ilya fixed errors while we resume disk balancing operations on
  readonly mounts."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: run delayed directory updates during log replay
  Btrfs: hold a ref on the inode during writepages
  Btrfs: fix tree log remove space corner case
  Btrfs: fix wrong check during log recovery
  Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS
  Btrfs: resume balance on rw (re)mounts properly
  Btrfs: restore restriper state on all mounts
  Btrfs: fix dio write vs buffered read race
  Btrfs: don't count I/O statistic read errors for missing devices
  Btrfs: resolve tree mod log locking issue in btrfs_next_leaf
  Btrfs: fix tree mod log rewind of ADD operations
  Btrfs: leave critical region in btrfs_find_all_roots as soon as possible
  Btrfs: always put insert_ptr modifications into the tree mod log
  Btrfs: fix tree mod log for root replacements at leaf level
  Btrfs: support root level changes in __resolve_indirect_ref
  Btrfs: avoid waiting for delayed refs when we must not
2012-07-05 13:06:25 -07:00
Chris Mason
b6305567e7 Btrfs: run delayed directory updates during log replay
While we are resolving directory modifications in the
tree log, we are triggering delayed metadata updates to
the filesystem btrees.

This commit forces the delayed updates to run so the
replay code can find any modifications done.  It stops
us from crashing because the directory deleltion replay
expects items to be removed immediately from the tree.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
cc: stable@kernel.org
2012-07-02 15:39:19 -04:00
Josef Bacik
7fd1a3f73f Btrfs: hold a ref on the inode during writepages
We can race with unlink and not actually be able to do our igrab in
btrfs_add_ordered_extent.  This will result in all sorts of problems.
Instead of doing the complicated work to try and handle returning an error
properly from btrfs_add_ordered_extent, just hold a ref to the inode during
writepages.  If we cannot grab a ref we know we're freeing this inode anyway
and can just drop the dirty pages on the floor, because screw them we're
going to invalidate them anyway.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-02 15:39:18 -04:00
Josef Bacik
bdb7d303b3 Btrfs: fix tree log remove space corner case
The tree log stuff can have allocated space that we end up having split
across a bitmap and a real extent.  The free space code does not deal with
this, it assumes that if it finds an extent or bitmap entry that the entire
range must fall within the entry it finds.  This isn't necessarily the case,
so rework the remove function so it can handle this case properly.  This
fixed two panics the user hit, first in the case where the space was
initially in a bitmap and then in an extent entry, and then the reverse
case.  Thanks,

Reported-and-tested-by: Shaun Reich <sreich@kde.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-02 15:39:18 -04:00
Liu Bo
6bf02314d9 Btrfs: fix wrong check during log recovery
When we're evicting an inode during log recovery, we need to ensure that the inode
is not in orphan state any more, which means inode's run_time flags has _no_
BTRFS_INODE_HAS_ORPHAN_ITEM.  Thus, the BUG_ON was triggered because of a wrong
check for the flags.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-02 15:39:17 -04:00
Alexander Block
d3a94048c9 Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS
We used the wrong ioctl macro for the getflags ioctl before.
As we don't have the set/getflags ioctls in the user space ioctl.h
at the moment, it's safe to fix it now.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-07-02 15:39:17 -04:00
Ilya Dryomov
2b6ba629b5 Btrfs: resume balance on rw (re)mounts properly
This introduces btrfs_resume_balance_async(), which, given that
restriper state was recovered earlier by btrfs_recover_balance(),
resumes balance in btrfs-balance kthread.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-02 15:39:17 -04:00
Ilya Dryomov
68310a5e42 Btrfs: restore restriper state on all mounts
Fix a bug that triggered asserts in btrfs_balance() in both normal and
resume modes -- restriper state was not properly restored on read-only
mounts.  This factors out resuming code from btrfs_restore_balance(),
which is now also called earlier in the mount sequence to avoid the
problem of some early writes getting the old profile.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-02 15:39:16 -04:00
Josef Bacik
c3473e8300 Btrfs: fix dio write vs buffered read race
Miao pointed out there's a problem with mixing dio writes and buffered
reads.  If the read happens between us invalidating the page range and
actually locking the extent we can bring in pages into page cache.  Then
once the write finishes if somebody tries to read again it will just find
uptodate pages and we'll read stale data.  So we need to lock the extent and
check for uptodate bits in the range.  If there are uptodate bits we need to
unlock and invalidate again.  This will keep this race from happening since
we will hold the extent locked until we create the ordered extent, and then
teh read side always waits for ordered extents.  There was also a race in
how we updated i_size, previously we were relying on the generic DIO stuff
to adjust the i_size after the DIO had completed, but this happens outside
of the extent lock which means reads could come in and not see the updated
i_size.  So instead move this work into where we create the extents, and
then this way the update ordered i_size stuff works properly in the endio
handlers.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-07-02 15:36:23 -04:00
Stefan Behrens
597a60fade Btrfs: don't count I/O statistic read errors for missing devices
It is normal behaviour of the low level btrfs function btrfs_map_bio()
to complete a bio with -EIO if the device is missing, instead of just
preventing the bio creation in an earlier step.
This used to cause I/O statistic read error increments and annoying
printk_ratelimited messages. This commit fixes the issue.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reported-by: Carey Underwood <cwillu@cwillu.com>
2012-07-02 15:36:23 -04:00
Jan Schmidt
d42244a0d3 Btrfs: resolve tree mod log locking issue in btrfs_next_leaf
With the tree mod log, we may end up with two roots (the current root and a
rewinded version of it) both pointing to two leaves, l1 and l2, of which l2
had already been cow-ed in the current transaction. If we don't rewind any
tree blocks, we cannot have two roots both pointing to an already cowed tree
block.

Now there is btrfs_next_leaf, which has a leaf locked and wants a lock on
the next (right) leaf. And there is push_leaf_left, which has a (cowed!)
leaf locked and wants a lock on the previous (left) leaf.

In order to solve this dead lock situation, we use try_lock in
btrfs_next_leaf (only in case it's called with a tree mod log time_seq
paramter) and if we fail to get a lock on the next leaf, we give up our lock
on the current leaf and retry from the very beginning.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:40 +02:00
Jan Schmidt
19956c7e94 Btrfs: fix tree mod log rewind of ADD operations
When a MOD_LOG_KEY_ADD operation is rewinded, we remove the key from the
tree block. If its not the last key, removal involves a move operation.
This move operation was explicitly done before this commit.

However, at insertion time, there's a move operation before the actual
addition to make room for the new key, which is recorded in the tree mod
log as well. This means, we must drop the move operation when rewinding the
add operation, because the next operation we'll be rewinding will be the
corresponding MOD_LOG_MOVE_KEYS operation.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:40 +02:00
Jan Schmidt
155725c9c0 Btrfs: leave critical region in btrfs_find_all_roots as soon as possible
When delayed refs exist, btrfs_find_all_roots used to hold the delayed ref
mutex way longer than actually required. We ought to drop it immediately
after we're done collecting all the delayed refs.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:39 +02:00
Jan Schmidt
c3e0696523 Btrfs: always put insert_ptr modifications into the tree mod log
Several callers of insert_ptr set the tree_mod_log parameter to 0 to avoid
addition to the tree mod log. In fact, we need all of those operations. This
commit simply removes the additional parameter and makes addition to the
tree mod log unconditional.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:39 +02:00
Jan Schmidt
28da9fb446 Btrfs: fix tree mod log for root replacements at leaf level
For the tree mod log, we don't log any operations at leaf level. If the root
is at the leaf level (i.e. the tree consists only of the root), then
__tree_mod_log_oldest_root will find a ROOT_REPLACE operation in the log
(because we always log that one no matter which level), but no other
operations.

With this patch __tree_mod_log_oldest_root exits cleanly instead of
BUGging in this situation. get_old_root checks if its really a root at leaf
level in case we don't have any operations and WARNs if this assumption
breaks.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:38 +02:00
Jan Schmidt
9345457f4a Btrfs: support root level changes in __resolve_indirect_ref
With the tree mod log, we can have a tree that's two levels high, but
btrfs_search_old_slot may still return a path with the tree root at level
one instead. __resolve_indirect_ref must care for this and accept parents in
a lower level than expected.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:38 +02:00
Jan Schmidt
8ca78f3eda Btrfs: avoid waiting for delayed refs when we must not
We track two conditions to decide if we should sleep while waiting for more
delayed refs, the number of delayed refs (num_refs) and the first entry in
the list of blockers (first_seq).

When we suspect staleness, we save num_refs and do one more cycle. If
nothing changes, we then save first_seq for later comparison and do
wait_event. We ought to save first_seq the very same moment we're saving
num_refs. Otherwise we cannot be sure that nothing has changed and we might
start waiting when we shouldn't, which could lead to starvation.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:35 +02:00
Linus Torvalds
8874e812fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is a small pull with btrfs fixes.  The biggest of the bunch is
  another fix for the new backref walking code.

  We're still hammering out one btrfs dio vs buffered reads problem, but
  that one will have to wait for the next rc."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: delay iput with async extents
  Btrfs: add a missing spin_lock
  Btrfs: don't assume to be on the correct extent in add_all_parents
  Btrfs: introduce btrfs_next_old_item
2012-06-21 13:41:07 -07:00
Josef Bacik
cb77fcd885 Btrfs: delay iput with async extents
There is some concern that these iput()'s could be the final iputs and could
induce lockups on people waiting on writeback.  This would happen in the
rare case that we don't create ordered extents because of an error, but it
is theoretically possible and we already have a mechanism to deal with this
so just make them delayed iputs to negate any worry.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-21 07:19:36 -04:00
Josef Bacik
e18fca7342 Btrfs: add a missing spin_lock
When fixing up the locking in the delayed ref destruction work I accidently
broke the locking myself ;(.  Add back a spin_lock that should be there and
we are now all set.  Thanks,
Btrfs: add a missing spin_lock

When fixing up the locking in the delayed ref destruction work I accidently
broke the locking myself ;(.  Add back a spin_lock that should be there and
we are now all set.  Thanks,

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-21 07:19:35 -04:00
Alexander Block
69bca40d41 Btrfs: don't assume to be on the correct extent in add_all_parents
add_all_parents did assume that path is already at a correct extent data
item, which may not be true in case of data extents that were partly
rewritten and splitted.

We need to check if we're on a matching extent for every item and only
for the ones after the first. The loop is changed to do this now.

This patch also fixes a bug introduced with commit 3b127fd8 "Btrfs:
remove obsolete btrfs_next_leaf call from __resolve_indirect_ref".
The removal of next_leaf did sometimes result in slot==nritems when
the above described case happens, and thus resulting in invalid values
(e.g. wanted_obejctid) in add_all_parents (leading to missed backrefs
or even crashes).

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-21 07:19:34 -04:00
Alexander Block
1c8f52a5e9 Btrfs: introduce btrfs_next_old_item
We introduce btrfs_next_old_item that uses btrfs_next_old_leaf instead
of btrfs_next_leaf.

btrfs_next_item is also changed to simply call btrfs_next_old_item with
time_seq being 0.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-21 07:19:34 -04:00
Linus Torvalds
d865983292 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs compile warning fixes from Chris Mason.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: cast devid to unsigned long long for printk %llu
  Btrfs: init old_generation in get_old_root
2012-06-16 17:01:41 -07:00
Chris Mason
a8c4a33b98 Btrfs: cast devid to unsigned long long for printk %llu
Avoid warning in 32 bit machines

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-15 20:07:17 -04:00