2007-06-12 21:07:21 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public
|
|
|
|
* License v2 as published by the Free Software Foundation.
|
|
|
|
*
|
|
|
|
* This program is distributed in the hope that it will be useful,
|
|
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
|
|
* General Public License for more details.
|
|
|
|
*
|
|
|
|
* You should have received a copy of the GNU General Public
|
|
|
|
* License along with this program; if not, write to the
|
|
|
|
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
|
|
|
|
* Boston, MA 021110-1307, USA.
|
|
|
|
*/
|
2007-07-11 22:00:37 +08:00
|
|
|
#include <linux/sched.h>
|
2007-12-22 05:27:24 +08:00
|
|
|
#include <linux/pagemap.h>
|
2008-04-29 03:29:52 +08:00
|
|
|
#include <linux/writeback.h>
|
2008-08-12 21:13:26 +08:00
|
|
|
#include <linux/blkdev.h>
|
2009-02-04 22:23:45 +08:00
|
|
|
#include <linux/sort.h>
|
2009-03-11 00:39:20 +08:00
|
|
|
#include <linux/rcupdate.h>
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
#include <linux/kthread.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2011-06-14 18:52:17 +08:00
|
|
|
#include <linux/ratelimit.h>
|
2008-11-20 23:22:27 +08:00
|
|
|
#include "compat.h"
|
2007-12-11 22:25:06 +08:00
|
|
|
#include "hash.h"
|
2007-02-26 23:40:21 +08:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "disk-io.h"
|
|
|
|
#include "print-tree.h"
|
2007-03-17 04:20:31 +08:00
|
|
|
#include "transaction.h"
|
2008-03-25 03:01:56 +08:00
|
|
|
#include "volumes.h"
|
2013-01-30 07:40:14 +08:00
|
|
|
#include "raid56.h"
|
2008-06-26 04:01:30 +08:00
|
|
|
#include "locking.h"
|
2009-04-03 21:47:43 +08:00
|
|
|
#include "free-space-cache.h"
|
2012-09-13 18:51:36 +08:00
|
|
|
#include "math.h"
|
2007-02-26 23:40:21 +08:00
|
|
|
|
2011-09-12 18:22:57 +08:00
|
|
|
#undef SCRAMBLE_DELAYED_REFS
|
|
|
|
|
Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 04:01:12 +08:00
|
|
|
/*
|
|
|
|
* control flags for do_chunk_alloc's force field
|
2011-04-16 04:05:44 +08:00
|
|
|
* CHUNK_ALLOC_NO_FORCE means to only allocate a chunk
|
|
|
|
* if we really need one.
|
|
|
|
*
|
|
|
|
* CHUNK_ALLOC_LIMITED means to only try and allocate one
|
|
|
|
* if we have very few chunks already allocated. This is
|
|
|
|
* used as part of the clustering code to help make sure
|
|
|
|
* we have a good pool of storage to cluster in, without
|
|
|
|
* filling the FS with empty chunks
|
|
|
|
*
|
Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 04:01:12 +08:00
|
|
|
* CHUNK_ALLOC_FORCE means it must try to allocate one
|
|
|
|
*
|
2011-04-16 04:05:44 +08:00
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
CHUNK_ALLOC_NO_FORCE = 0,
|
Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 04:01:12 +08:00
|
|
|
CHUNK_ALLOC_LIMITED = 1,
|
|
|
|
CHUNK_ALLOC_FORCE = 2,
|
2011-04-16 04:05:44 +08:00
|
|
|
};
|
|
|
|
|
2011-07-27 05:00:46 +08:00
|
|
|
/*
|
|
|
|
* Control how reservations are dealt with.
|
|
|
|
*
|
|
|
|
* RESERVE_FREE - freeing a reservation.
|
|
|
|
* RESERVE_ALLOC - allocating space and we need to update bytes_may_use for
|
|
|
|
* ENOSPC accounting
|
|
|
|
* RESERVE_ALLOC_NO_ACCOUNT - allocating space and we should not update
|
|
|
|
* bytes_may_use as the ENOSPC accounting is done elsewhere
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
RESERVE_FREE = 0,
|
|
|
|
RESERVE_ALLOC = 1,
|
|
|
|
RESERVE_ALLOC_NO_ACCOUNT = 2,
|
|
|
|
};
|
|
|
|
|
2012-12-27 17:01:19 +08:00
|
|
|
static int update_block_group(struct btrfs_root *root,
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 bytenr, u64 num_bytes, int alloc);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner_objectid,
|
|
|
|
u64 owner_offset, int refs_to_drop,
|
|
|
|
struct btrfs_delayed_extent_op *extra_op);
|
|
|
|
static void __run_delayed_extent_op(struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_extent_item *ei);
|
|
|
|
static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 flags, u64 owner, u64 offset,
|
|
|
|
struct btrfs_key *ins, int ref_mod);
|
|
|
|
static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 flags, struct btrfs_disk_key *key,
|
|
|
|
int level, struct btrfs_key *ins);
|
2009-02-21 00:00:09 +08:00
|
|
|
static int do_chunk_alloc(struct btrfs_trans_handle *trans,
|
2012-09-13 02:08:47 +08:00
|
|
|
struct btrfs_root *extent_root, u64 flags,
|
|
|
|
int force);
|
2009-09-12 04:11:19 +08:00
|
|
|
static int find_next_key(struct btrfs_path *path, int level,
|
|
|
|
struct btrfs_key *key);
|
2009-09-12 04:12:44 +08:00
|
|
|
static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
|
|
|
|
int dump_block_groups);
|
2011-07-27 05:00:46 +08:00
|
|
|
static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
|
|
|
|
u64 num_bytes, int reserve);
|
2013-02-08 05:06:02 +08:00
|
|
|
static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes);
|
2009-02-21 00:00:09 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
static noinline int
|
|
|
|
block_group_cache_done(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
smp_mb();
|
|
|
|
return cache->cached == BTRFS_CACHE_FINISHED;
|
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
static int block_group_bits(struct btrfs_block_group_cache *cache, u64 bits)
|
|
|
|
{
|
|
|
|
return (cache->flags & bits) == bits;
|
|
|
|
}
|
|
|
|
|
2011-04-20 21:52:26 +08:00
|
|
|
static void btrfs_get_block_group(struct btrfs_block_group_cache *cache)
|
2009-11-14 04:12:59 +08:00
|
|
|
{
|
|
|
|
atomic_inc(&cache->count);
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_put_block_group(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
if (atomic_dec_and_test(&cache->count)) {
|
|
|
|
WARN_ON(cache->pinned > 0);
|
|
|
|
WARN_ON(cache->reserved > 0);
|
2011-03-29 13:46:06 +08:00
|
|
|
kfree(cache->free_space_ctl);
|
2009-11-14 04:12:59 +08:00
|
|
|
kfree(cache);
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2009-11-14 04:12:59 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
/*
|
|
|
|
* this adds the block group to the fs_info rb tree for the block group
|
|
|
|
* cache
|
|
|
|
*/
|
2008-12-02 22:54:17 +08:00
|
|
|
static int btrfs_add_block_group_cache(struct btrfs_fs_info *info,
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *block_group)
|
|
|
|
{
|
|
|
|
struct rb_node **p;
|
|
|
|
struct rb_node *parent = NULL;
|
|
|
|
struct btrfs_block_group_cache *cache;
|
|
|
|
|
|
|
|
spin_lock(&info->block_group_cache_lock);
|
|
|
|
p = &info->block_group_cache_tree.rb_node;
|
|
|
|
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
|
|
|
cache = rb_entry(parent, struct btrfs_block_group_cache,
|
|
|
|
cache_node);
|
|
|
|
if (block_group->key.objectid < cache->key.objectid) {
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
} else if (block_group->key.objectid > cache->key.objectid) {
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
} else {
|
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
rb_link_node(&block_group->cache_node, parent, p);
|
|
|
|
rb_insert_color(&block_group->cache_node,
|
|
|
|
&info->block_group_cache_tree);
|
2012-12-27 17:01:23 +08:00
|
|
|
|
|
|
|
if (info->first_logical_byte > block_group->key.objectid)
|
|
|
|
info->first_logical_byte = block_group->key.objectid;
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This will return the block group at or after bytenr if contains is 0, else
|
|
|
|
* it will return the block group that contains the bytenr
|
|
|
|
*/
|
|
|
|
static struct btrfs_block_group_cache *
|
|
|
|
block_group_cache_tree_search(struct btrfs_fs_info *info, u64 bytenr,
|
|
|
|
int contains)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache, *ret = NULL;
|
|
|
|
struct rb_node *n;
|
|
|
|
u64 end, start;
|
|
|
|
|
|
|
|
spin_lock(&info->block_group_cache_lock);
|
|
|
|
n = info->block_group_cache_tree.rb_node;
|
|
|
|
|
|
|
|
while (n) {
|
|
|
|
cache = rb_entry(n, struct btrfs_block_group_cache,
|
|
|
|
cache_node);
|
|
|
|
end = cache->key.objectid + cache->key.offset - 1;
|
|
|
|
start = cache->key.objectid;
|
|
|
|
|
|
|
|
if (bytenr < start) {
|
|
|
|
if (!contains && (!ret || start < ret->key.objectid))
|
|
|
|
ret = cache;
|
|
|
|
n = n->rb_left;
|
|
|
|
} else if (bytenr > start) {
|
|
|
|
if (contains && bytenr <= end) {
|
|
|
|
ret = cache;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
n = n->rb_right;
|
|
|
|
} else {
|
|
|
|
ret = cache;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2012-12-27 17:01:23 +08:00
|
|
|
if (ret) {
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_get_block_group(ret);
|
2012-12-27 17:01:23 +08:00
|
|
|
if (bytenr == 0 && info->first_logical_byte > ret->key.objectid)
|
|
|
|
info->first_logical_byte = ret->key.objectid;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static int add_excluded_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 num_bytes)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
u64 end = start + num_bytes - 1;
|
|
|
|
set_extent_bits(&root->fs_info->freed_extents[0],
|
|
|
|
start, end, EXTENT_UPTODATE, GFP_NOFS);
|
|
|
|
set_extent_bits(&root->fs_info->freed_extents[1],
|
|
|
|
start, end, EXTENT_UPTODATE, GFP_NOFS);
|
|
|
|
return 0;
|
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static void free_excluded_extents(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
u64 start, end;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
start = cache->key.objectid;
|
|
|
|
end = start + cache->key.offset - 1;
|
|
|
|
|
|
|
|
clear_extent_bits(&root->fs_info->freed_extents[0],
|
|
|
|
start, end, EXTENT_UPTODATE, GFP_NOFS);
|
|
|
|
clear_extent_bits(&root->fs_info->freed_extents[1],
|
|
|
|
start, end, EXTENT_UPTODATE, GFP_NOFS);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static int exclude_super_stripes(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
{
|
|
|
|
u64 bytenr;
|
|
|
|
u64 *logical;
|
|
|
|
int stripe_len;
|
|
|
|
int i, nr, ret;
|
|
|
|
|
2009-11-26 17:31:11 +08:00
|
|
|
if (cache->key.objectid < BTRFS_SUPER_INFO_OFFSET) {
|
|
|
|
stripe_len = BTRFS_SUPER_INFO_OFFSET - cache->key.objectid;
|
|
|
|
cache->bytes_super += stripe_len;
|
|
|
|
ret = add_excluded_extent(root, cache->key.objectid,
|
|
|
|
stripe_len);
|
2013-03-20 00:13:25 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
2009-11-26 17:31:11 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
|
|
|
|
bytenr = btrfs_sb_offset(i);
|
|
|
|
ret = btrfs_rmap_block(&root->fs_info->mapping_tree,
|
|
|
|
cache->key.objectid, bytenr,
|
|
|
|
0, &logical, &nr, &stripe_len);
|
2013-03-20 00:13:25 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
2009-09-12 04:11:19 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
while (nr--) {
|
2009-09-12 04:11:20 +08:00
|
|
|
cache->bytes_super += stripe_len;
|
2009-09-12 04:11:19 +08:00
|
|
|
ret = add_excluded_extent(root, logical[nr],
|
|
|
|
stripe_len);
|
2013-03-20 00:13:25 +08:00
|
|
|
if (ret) {
|
|
|
|
kfree(logical);
|
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
2009-09-12 04:11:19 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
kfree(logical);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static struct btrfs_caching_control *
|
|
|
|
get_caching_control(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
struct btrfs_caching_control *ctl;
|
|
|
|
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (cache->cached != BTRFS_CACHE_STARTED) {
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2010-09-17 04:17:03 +08:00
|
|
|
/* We're loading it the fast way, so we don't have a caching_ctl. */
|
|
|
|
if (!cache->caching_ctl) {
|
|
|
|
spin_unlock(&cache->lock);
|
2009-09-12 04:11:19 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
ctl = cache->caching_ctl;
|
|
|
|
atomic_inc(&ctl->count);
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
return ctl;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void put_caching_control(struct btrfs_caching_control *ctl)
|
|
|
|
{
|
|
|
|
if (atomic_dec_and_test(&ctl->count))
|
|
|
|
kfree(ctl);
|
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
/*
|
|
|
|
* this is only called by cache_block_group, since we could have freed extents
|
|
|
|
* we need to check the pinned_extents for any extents that can't be used yet
|
|
|
|
* since their free space will be released as soon as the transaction commits.
|
|
|
|
*/
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
static u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_fs_info *info, u64 start, u64 end)
|
|
|
|
{
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
u64 extent_start, extent_end, size, total_added = 0;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
while (start < end) {
|
2009-09-12 04:11:19 +08:00
|
|
|
ret = find_first_extent_bit(info->pinned_extents, start,
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
&extent_start, &extent_end,
|
2012-09-28 05:07:30 +08:00
|
|
|
EXTENT_DIRTY | EXTENT_UPTODATE,
|
|
|
|
NULL);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
|
2009-11-26 17:31:11 +08:00
|
|
|
if (extent_start <= start) {
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
start = extent_end + 1;
|
|
|
|
} else if (extent_start > start && extent_start < end) {
|
|
|
|
size = extent_start - start;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
total_added += size;
|
2008-11-21 01:16:16 +08:00
|
|
|
ret = btrfs_add_free_space(block_group, start,
|
|
|
|
size);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM or logic error */
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
start = extent_end + 1;
|
|
|
|
} else {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (start < end) {
|
|
|
|
size = end - start;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
total_added += size;
|
2008-11-21 01:16:16 +08:00
|
|
|
ret = btrfs_add_free_space(block_group, start, size);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM or logic error */
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
return total_added;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
|
|
|
|
2011-07-01 02:42:28 +08:00
|
|
|
static noinline void caching_thread(struct btrfs_work *work)
|
2007-05-10 08:13:14 +08:00
|
|
|
{
|
2011-07-01 02:42:28 +08:00
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
struct btrfs_root *extent_root;
|
2007-05-10 08:13:14 +08:00
|
|
|
struct btrfs_path *path;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *leaf;
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
u64 total_found = 0;
|
2009-09-12 04:11:19 +08:00
|
|
|
u64 last = 0;
|
|
|
|
u32 nritems;
|
|
|
|
int ret = 0;
|
2007-10-16 04:14:48 +08:00
|
|
|
|
2011-07-01 02:42:28 +08:00
|
|
|
caching_ctl = container_of(work, struct btrfs_caching_control, work);
|
|
|
|
block_group = caching_ctl->block_group;
|
|
|
|
fs_info = block_group->fs_info;
|
|
|
|
extent_root = fs_info->extent_root;
|
|
|
|
|
2007-05-10 08:13:14 +08:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
2011-07-01 02:42:28 +08:00
|
|
|
goto out;
|
2007-09-15 04:15:28 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
last = max_t(u64, block_group->key.objectid, BTRFS_SUPER_INFO_OFFSET);
|
2009-09-12 04:11:19 +08:00
|
|
|
|
2008-06-26 04:01:30 +08:00
|
|
|
/*
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
* We don't want to deadlock with somebody trying to allocate a new
|
|
|
|
* extent for the extent root while also trying to search the extent
|
|
|
|
* root to add free space. So we skip locking and search the commit
|
|
|
|
* root, since its read-only
|
2008-06-26 04:01:30 +08:00
|
|
|
*/
|
|
|
|
path->skip_locking = 1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
path->search_commit_root = 1;
|
2011-05-13 22:32:11 +08:00
|
|
|
path->reada = 1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
key.objectid = last;
|
2007-05-10 08:13:14 +08:00
|
|
|
key.offset = 0;
|
2009-09-12 04:11:19 +08:00
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
2009-08-01 02:57:55 +08:00
|
|
|
again:
|
2009-09-12 04:11:19 +08:00
|
|
|
mutex_lock(&caching_ctl->mutex);
|
2009-08-01 02:57:55 +08:00
|
|
|
/* need to make sure the commit_root doesn't disappear */
|
|
|
|
down_read(&fs_info->extent_commit_sem);
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
|
2007-05-10 08:13:14 +08:00
|
|
|
if (ret < 0)
|
2008-09-24 01:14:11 +08:00
|
|
|
goto err;
|
2008-12-09 05:46:26 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2011-06-01 00:07:27 +08:00
|
|
|
if (btrfs_fs_closing(fs_info) > 1) {
|
2009-07-28 20:41:57 +08:00
|
|
|
last = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
break;
|
2009-07-28 20:41:57 +08:00
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (path->slots[0] < nritems) {
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
} else {
|
|
|
|
ret = find_next_key(path, 0, &key);
|
|
|
|
if (ret)
|
2007-05-10 08:13:14 +08:00
|
|
|
break;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2011-05-12 05:30:53 +08:00
|
|
|
if (need_resched() ||
|
|
|
|
btrfs_next_leaf(extent_root, path)) {
|
|
|
|
caching_ctl->progress = last;
|
2011-05-28 19:00:39 +08:00
|
|
|
btrfs_release_path(path);
|
2011-05-12 05:30:53 +08:00
|
|
|
up_read(&fs_info->extent_commit_sem);
|
|
|
|
mutex_unlock(&caching_ctl->mutex);
|
2009-09-12 04:11:19 +08:00
|
|
|
cond_resched();
|
2011-05-12 05:30:53 +08:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
continue;
|
2009-09-12 04:11:19 +08:00
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (key.objectid < block_group->key.objectid) {
|
|
|
|
path->slots[0]++;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
continue;
|
2007-05-10 08:13:14 +08:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2007-05-10 08:13:14 +08:00
|
|
|
if (key.objectid >= block_group->key.objectid +
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
block_group->key.offset)
|
2007-05-10 08:13:14 +08:00
|
|
|
break;
|
2007-09-15 04:15:28 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (key.type == BTRFS_EXTENT_ITEM_KEY) {
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
total_found += add_new_free_space(block_group,
|
|
|
|
fs_info, last,
|
|
|
|
key.objectid);
|
2007-09-15 04:15:28 +08:00
|
|
|
last = key.objectid + key.offset;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (total_found > (1024 * 1024 * 2)) {
|
|
|
|
total_found = 0;
|
|
|
|
wake_up(&caching_ctl->wait);
|
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
2007-05-10 08:13:14 +08:00
|
|
|
path->slots[0]++;
|
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
ret = 0;
|
2007-05-10 08:13:14 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
total_found += add_new_free_space(block_group, fs_info, last,
|
|
|
|
block_group->key.objectid +
|
|
|
|
block_group->key.offset);
|
2009-09-12 04:11:19 +08:00
|
|
|
caching_ctl->progress = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
2009-09-12 04:11:19 +08:00
|
|
|
block_group->caching_ctl = NULL;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
block_group->cached = BTRFS_CACHE_FINISHED;
|
|
|
|
spin_unlock(&block_group->lock);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2007-06-23 02:16:25 +08:00
|
|
|
err:
|
2007-05-10 08:13:14 +08:00
|
|
|
btrfs_free_path(path);
|
2009-07-30 21:40:40 +08:00
|
|
|
up_read(&fs_info->extent_commit_sem);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
free_excluded_extents(extent_root, block_group);
|
|
|
|
|
|
|
|
mutex_unlock(&caching_ctl->mutex);
|
2011-07-01 02:42:28 +08:00
|
|
|
out:
|
2009-09-12 04:11:19 +08:00
|
|
|
wake_up(&caching_ctl->wait);
|
|
|
|
|
|
|
|
put_caching_control(caching_ctl);
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
|
|
|
|
2010-08-26 04:54:15 +08:00
|
|
|
static int cache_block_group(struct btrfs_block_group_cache *cache,
|
|
|
|
int load_cache_only)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
{
|
2011-11-15 02:52:14 +08:00
|
|
|
DEFINE_WAIT(wait);
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_fs_info *fs_info = cache->fs_info;
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2011-11-15 02:52:14 +08:00
|
|
|
caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (!caching_ctl)
|
|
|
|
return -ENOMEM;
|
2011-11-15 02:52:14 +08:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&caching_ctl->list);
|
|
|
|
mutex_init(&caching_ctl->mutex);
|
|
|
|
init_waitqueue_head(&caching_ctl->wait);
|
|
|
|
caching_ctl->block_group = cache;
|
|
|
|
caching_ctl->progress = cache->key.objectid;
|
|
|
|
atomic_set(&caching_ctl->count, 1);
|
|
|
|
caching_ctl->work.func = caching_thread;
|
|
|
|
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
/*
|
|
|
|
* This should be a rare occasion, but this could happen I think in the
|
|
|
|
* case where one thread starts to load the space cache info, and then
|
|
|
|
* some other thread starts a transaction commit which tries to do an
|
|
|
|
* allocation while the other thread is still loading the space cache
|
|
|
|
* info. The previous loop should have kept us from choosing this block
|
|
|
|
* group, but if we've moved to the state where we will wait on caching
|
|
|
|
* block groups we need to first check if we're doing a fast load here,
|
|
|
|
* so we can wait for it to finish, otherwise we could end up allocating
|
|
|
|
* from a block group who's cache gets evicted for one reason or
|
|
|
|
* another.
|
|
|
|
*/
|
|
|
|
while (cache->cached == BTRFS_CACHE_FAST) {
|
|
|
|
struct btrfs_caching_control *ctl;
|
|
|
|
|
|
|
|
ctl = cache->caching_ctl;
|
|
|
|
atomic_inc(&ctl->count);
|
|
|
|
prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE);
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
|
|
|
|
schedule();
|
|
|
|
|
|
|
|
finish_wait(&ctl->wait, &wait);
|
|
|
|
put_caching_control(ctl);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (cache->cached != BTRFS_CACHE_NO) {
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
kfree(caching_ctl);
|
2009-09-12 04:11:19 +08:00
|
|
|
return 0;
|
2011-11-15 02:52:14 +08:00
|
|
|
}
|
|
|
|
WARN_ON(cache->caching_ctl);
|
|
|
|
cache->caching_ctl = caching_ctl;
|
|
|
|
cache->cached = BTRFS_CACHE_FAST;
|
|
|
|
spin_unlock(&cache->lock);
|
2009-09-12 04:11:19 +08:00
|
|
|
|
2012-04-13 04:03:57 +08:00
|
|
|
if (fs_info->mount_opt & BTRFS_MOUNT_SPACE_CACHE) {
|
2010-08-26 04:54:15 +08:00
|
|
|
ret = load_free_space_cache(fs_info, cache);
|
|
|
|
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (ret == 1) {
|
2011-11-15 02:52:14 +08:00
|
|
|
cache->caching_ctl = NULL;
|
2010-08-26 04:54:15 +08:00
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
|
|
|
} else {
|
2011-11-15 02:52:14 +08:00
|
|
|
if (load_cache_only) {
|
|
|
|
cache->caching_ctl = NULL;
|
|
|
|
cache->cached = BTRFS_CACHE_NO;
|
|
|
|
} else {
|
|
|
|
cache->cached = BTRFS_CACHE_STARTED;
|
|
|
|
}
|
2010-08-26 04:54:15 +08:00
|
|
|
}
|
|
|
|
spin_unlock(&cache->lock);
|
2011-11-15 02:52:14 +08:00
|
|
|
wake_up(&caching_ctl->wait);
|
2011-02-02 23:53:47 +08:00
|
|
|
if (ret == 1) {
|
2011-11-15 02:52:14 +08:00
|
|
|
put_caching_control(caching_ctl);
|
2011-02-02 23:53:47 +08:00
|
|
|
free_excluded_extents(fs_info->extent_root, cache);
|
2010-08-26 04:54:15 +08:00
|
|
|
return 0;
|
2011-02-02 23:53:47 +08:00
|
|
|
}
|
2011-11-15 02:52:14 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We are not going to do the fast caching, set cached to the
|
|
|
|
* appropriate value and wakeup any waiters.
|
|
|
|
*/
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (load_cache_only) {
|
|
|
|
cache->caching_ctl = NULL;
|
|
|
|
cache->cached = BTRFS_CACHE_NO;
|
|
|
|
} else {
|
|
|
|
cache->cached = BTRFS_CACHE_STARTED;
|
|
|
|
}
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
wake_up(&caching_ctl->wait);
|
2010-08-26 04:54:15 +08:00
|
|
|
}
|
|
|
|
|
2011-11-15 02:52:14 +08:00
|
|
|
if (load_cache_only) {
|
|
|
|
put_caching_control(caching_ctl);
|
2009-09-12 04:11:19 +08:00
|
|
|
return 0;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
down_write(&fs_info->extent_commit_sem);
|
2011-11-15 02:52:14 +08:00
|
|
|
atomic_inc(&caching_ctl->count);
|
2009-09-12 04:11:19 +08:00
|
|
|
list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
|
|
|
|
up_write(&fs_info->extent_commit_sem);
|
|
|
|
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_get_block_group(cache);
|
2009-09-12 04:11:19 +08:00
|
|
|
|
2011-07-01 02:42:28 +08:00
|
|
|
btrfs_queue_worker(&fs_info->caching_workers, &caching_ctl->work);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2008-09-24 01:14:11 +08:00
|
|
|
return ret;
|
2007-05-10 08:13:14 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
/*
|
|
|
|
* return the block group that starts at or after bytenr
|
|
|
|
*/
|
2009-01-06 10:25:51 +08:00
|
|
|
static struct btrfs_block_group_cache *
|
|
|
|
btrfs_lookup_first_block_group(struct btrfs_fs_info *info, u64 bytenr)
|
2008-05-25 02:04:53 +08:00
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2008-05-25 02:04:53 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
cache = block_group_cache_tree_search(info, bytenr, 0);
|
2008-05-25 02:04:53 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return cache;
|
2008-05-25 02:04:53 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
/*
|
2009-05-15 01:52:22 +08:00
|
|
|
* return the block group that contains the given bytenr
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
*/
|
2009-01-06 10:25:51 +08:00
|
|
|
struct btrfs_block_group_cache *btrfs_lookup_block_group(
|
|
|
|
struct btrfs_fs_info *info,
|
|
|
|
u64 bytenr)
|
2007-05-06 22:15:01 +08:00
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2007-05-06 22:15:01 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
cache = block_group_cache_tree_search(info, bytenr, 1);
|
2007-10-16 04:15:19 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return cache;
|
2007-05-06 22:15:01 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
static struct btrfs_space_info *__find_space_info(struct btrfs_fs_info *info,
|
|
|
|
u64 flags)
|
2008-03-25 03:01:59 +08:00
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
2009-03-11 00:39:20 +08:00
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
flags &= BTRFS_BLOCK_GROUP_TYPE_MASK;
|
2010-05-16 22:46:24 +08:00
|
|
|
|
2009-03-11 00:39:20 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(found, head, list) {
|
2010-09-17 04:19:09 +08:00
|
|
|
if (found->flags & flags) {
|
2009-03-11 00:39:20 +08:00
|
|
|
rcu_read_unlock();
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return found;
|
2009-03-11 00:39:20 +08:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
2009-03-11 00:39:20 +08:00
|
|
|
rcu_read_unlock();
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return NULL;
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
|
|
|
|
2009-03-11 00:39:20 +08:00
|
|
|
/*
|
|
|
|
* after adding space to the filesystem, we need to clear the full flags
|
|
|
|
* on all the space infos.
|
|
|
|
*/
|
|
|
|
void btrfs_clear_space_info_full(struct btrfs_fs_info *info)
|
|
|
|
{
|
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(found, head, list)
|
|
|
|
found->full = 0;
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
u64 btrfs_find_block_group(struct btrfs_root *root,
|
|
|
|
u64 search_start, u64 search_hint, int owner)
|
2007-04-27 22:08:34 +08:00
|
|
|
{
|
2007-10-16 04:15:19 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2007-04-27 22:08:34 +08:00
|
|
|
u64 used;
|
2008-12-12 05:30:39 +08:00
|
|
|
u64 last = max(search_hint, search_start);
|
|
|
|
u64 group_start = 0;
|
2007-05-01 03:25:45 +08:00
|
|
|
int full_search = 0;
|
2008-12-12 05:30:39 +08:00
|
|
|
int factor = 9;
|
2008-05-25 02:04:53 +08:00
|
|
|
int wrapped = 0;
|
2007-05-01 03:25:45 +08:00
|
|
|
again:
|
2008-09-26 22:05:48 +08:00
|
|
|
while (1) {
|
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, last);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
if (!cache)
|
|
|
|
break;
|
2007-10-16 04:15:19 +08:00
|
|
|
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_lock(&cache->lock);
|
2007-10-16 04:15:19 +08:00
|
|
|
last = cache->key.objectid + cache->key.offset;
|
|
|
|
used = btrfs_block_group_used(&cache->item);
|
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
if ((full_search || !cache->ro) &&
|
|
|
|
block_group_bits(cache, BTRFS_BLOCK_GROUP_METADATA)) {
|
2008-09-26 22:05:48 +08:00
|
|
|
if (used + cache->pinned + cache->reserved <
|
2008-12-12 05:30:39 +08:00
|
|
|
div_factor(cache->key.offset, factor)) {
|
|
|
|
group_start = cache->key.objectid;
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_unlock(&cache->lock);
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2008-04-04 04:29:03 +08:00
|
|
|
goto found;
|
|
|
|
}
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_unlock(&cache->lock);
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2007-05-19 01:28:27 +08:00
|
|
|
cond_resched();
|
2007-04-27 22:08:34 +08:00
|
|
|
}
|
2008-05-25 02:04:53 +08:00
|
|
|
if (!wrapped) {
|
|
|
|
last = search_start;
|
|
|
|
wrapped = 1;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
if (!full_search && factor < 10) {
|
2007-05-06 22:15:01 +08:00
|
|
|
last = search_start;
|
2007-05-01 03:25:45 +08:00
|
|
|
full_search = 1;
|
2008-05-25 02:04:53 +08:00
|
|
|
factor = 10;
|
2007-05-01 03:25:45 +08:00
|
|
|
goto again;
|
|
|
|
}
|
2007-05-06 22:15:01 +08:00
|
|
|
found:
|
2008-12-12 05:30:39 +08:00
|
|
|
return group_start;
|
2008-06-26 04:01:30 +08:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2008-09-06 04:13:11 +08:00
|
|
|
/* simple helper to search for an existing extent at a given offset */
|
2008-09-24 01:14:14 +08:00
|
|
|
int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len)
|
2008-09-06 04:13:11 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_key key;
|
2008-09-24 01:14:14 +08:00
|
|
|
struct btrfs_path *path;
|
2008-09-06 04:13:11 +08:00
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
path = btrfs_alloc_path();
|
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 01:38:47 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2008-09-06 04:13:11 +08:00
|
|
|
key.objectid = start;
|
|
|
|
key.offset = len;
|
|
|
|
btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
|
|
|
|
ret = btrfs_search_slot(NULL, root->fs_info->extent_root, &key, path,
|
|
|
|
0, 0);
|
2008-09-24 01:14:14 +08:00
|
|
|
btrfs_free_path(path);
|
2007-12-11 22:25:06 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
/*
|
|
|
|
* helper function to lookup reference count and flags of extent.
|
|
|
|
*
|
|
|
|
* the head node for delayed ref is used to store the sum of all the
|
|
|
|
* reference count modifications queued up in the rbtree. the head
|
|
|
|
* node may also store the extent flags to set. This way you can check
|
|
|
|
* to see what the reference count and extent flags would be if all of
|
|
|
|
* the delayed refs are not processed.
|
|
|
|
*/
|
|
|
|
int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytenr,
|
|
|
|
u64 num_bytes, u64 *refs, u64 *flags)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_head *head;
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
u32 item_size;
|
|
|
|
u64 num_refs;
|
|
|
|
u64 extent_flags;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = bytenr;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
key.offset = num_bytes;
|
|
|
|
if (!trans) {
|
|
|
|
path->skip_locking = 1;
|
|
|
|
path->search_commit_root = 1;
|
|
|
|
}
|
|
|
|
again:
|
|
|
|
ret = btrfs_search_slot(trans, root->fs_info->extent_root,
|
|
|
|
&key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
if (ret == 0) {
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
if (item_size >= sizeof(*ei)) {
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_item);
|
|
|
|
num_refs = btrfs_extent_refs(leaf, ei);
|
|
|
|
extent_flags = btrfs_extent_flags(leaf, ei);
|
|
|
|
} else {
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
struct btrfs_extent_item_v0 *ei0;
|
|
|
|
BUG_ON(item_size != sizeof(*ei0));
|
|
|
|
ei0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_item_v0);
|
|
|
|
num_refs = btrfs_extent_refs_v0(leaf, ei0);
|
|
|
|
/* FIXME: this isn't correct for data */
|
|
|
|
extent_flags = BTRFS_BLOCK_FLAG_FULL_BACKREF;
|
|
|
|
#else
|
|
|
|
BUG();
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
BUG_ON(num_refs == 0);
|
|
|
|
} else {
|
|
|
|
num_refs = 0;
|
|
|
|
extent_flags = 0;
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!trans)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
head = btrfs_find_delayed_ref_head(trans, bytenr);
|
|
|
|
if (head) {
|
|
|
|
if (!mutex_trylock(&head->mutex)) {
|
|
|
|
atomic_inc(&head->node.refs);
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-05-16 22:48:46 +08:00
|
|
|
|
2011-05-02 21:29:25 +08:00
|
|
|
/*
|
|
|
|
* Mutex was contended, block until it's released and try
|
|
|
|
* again
|
|
|
|
*/
|
2010-05-16 22:48:46 +08:00
|
|
|
mutex_lock(&head->mutex);
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
btrfs_put_delayed_ref(&head->node);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
if (head->extent_op && head->extent_op->update_flags)
|
|
|
|
extent_flags |= head->extent_op->flags_to_set;
|
|
|
|
else
|
|
|
|
BUG_ON(num_refs == 0);
|
|
|
|
|
|
|
|
num_refs += head->node.ref_mod;
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
}
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
out:
|
|
|
|
WARN_ON(num_refs == 0);
|
|
|
|
if (refs)
|
|
|
|
*refs = num_refs;
|
|
|
|
if (flags)
|
|
|
|
*flags = extent_flags;
|
|
|
|
out_free:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2007-12-12 01:42:00 +08:00
|
|
|
/*
|
|
|
|
* Back reference rules. Back refs have three main goals:
|
|
|
|
*
|
|
|
|
* 1) differentiate between all holders of references to an extent so that
|
|
|
|
* when a reference is dropped we can make sure it was a valid reference
|
|
|
|
* before freeing the extent.
|
|
|
|
*
|
|
|
|
* 2) Provide enough information to quickly find the holders of an extent
|
|
|
|
* if we notice a given block is corrupted or bad.
|
|
|
|
*
|
|
|
|
* 3) Make it easy to migrate blocks for FS shrinking or storage pool
|
|
|
|
* maintenance. This is actually the same as #2, but with a slightly
|
|
|
|
* different use case.
|
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* There are two kinds of back refs. The implicit back refs is optimized
|
|
|
|
* for pointers in non-shared tree blocks. For a given pointer in a block,
|
|
|
|
* back refs of this kind provide information about the block's owner tree
|
|
|
|
* and the pointer's key. These information allow us to find the block by
|
|
|
|
* b-tree searching. The full back refs is for pointers in tree blocks not
|
|
|
|
* referenced by their owner trees. The location of tree block is recorded
|
|
|
|
* in the back refs. Actually the full back refs is generic, and can be
|
|
|
|
* used in all cases the implicit back refs is used. The major shortcoming
|
|
|
|
* of the full back refs is its overhead. Every time a tree block gets
|
|
|
|
* COWed, we have to update back refs entry for all pointers in it.
|
|
|
|
*
|
|
|
|
* For a newly allocated tree block, we use implicit back refs for
|
|
|
|
* pointers in it. This means most tree related operations only involve
|
|
|
|
* implicit back refs. For a tree block created in old transaction, the
|
|
|
|
* only way to drop a reference to it is COW it. So we can detect the
|
|
|
|
* event that tree block loses its owner tree's reference and do the
|
|
|
|
* back refs conversion.
|
|
|
|
*
|
|
|
|
* When a tree block is COW'd through a tree, there are four cases:
|
|
|
|
*
|
|
|
|
* The reference count of the block is one and the tree is the block's
|
|
|
|
* owner tree. Nothing to do in this case.
|
|
|
|
*
|
|
|
|
* The reference count of the block is one and the tree is not the
|
|
|
|
* block's owner tree. In this case, full back refs is used for pointers
|
|
|
|
* in the block. Remove these full back refs, add implicit back refs for
|
|
|
|
* every pointers in the new block.
|
|
|
|
*
|
|
|
|
* The reference count of the block is greater than one and the tree is
|
|
|
|
* the block's owner tree. In this case, implicit back refs is used for
|
|
|
|
* pointers in the block. Add full back refs for every pointers in the
|
|
|
|
* block, increase lower level extents' reference counts. The original
|
|
|
|
* implicit back refs are entailed to the new block.
|
|
|
|
*
|
|
|
|
* The reference count of the block is greater than one and the tree is
|
|
|
|
* not the block's owner tree. Add implicit back refs for every pointer in
|
|
|
|
* the new block, increase lower level extents' reference count.
|
|
|
|
*
|
|
|
|
* Back Reference Key composing:
|
|
|
|
*
|
|
|
|
* The key objectid corresponds to the first byte in the extent,
|
|
|
|
* The key type is used to differentiate between types of back refs.
|
|
|
|
* There are different meanings of the key offset for different types
|
|
|
|
* of back refs.
|
|
|
|
*
|
2007-12-12 01:42:00 +08:00
|
|
|
* File extents can be referenced by:
|
|
|
|
*
|
|
|
|
* - multiple snapshots, subvolumes, or different generations in one subvol
|
2008-09-24 01:14:14 +08:00
|
|
|
* - different files inside a single subvolume
|
2007-12-12 01:42:00 +08:00
|
|
|
* - different offsets inside a file (bookend extents in file.c)
|
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* The extent ref structure for the implicit back refs has fields for:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
|
|
|
* - Objectid of the subvolume root
|
|
|
|
* - objectid of the file holding the reference
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* - original offset in the file
|
|
|
|
* - how many bookend extents
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* The key offset for the implicit back refs is hash of the first
|
|
|
|
* three fields.
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* The extent ref structure for the full back refs has field for:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* - number of pointers in the tree leaf
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* The key offset for the implicit back refs is the first byte of
|
|
|
|
* the tree leaf
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* When a file extent is allocated, The implicit back refs is used.
|
|
|
|
* the fields are filled in:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* (root_key.objectid, inode objectid, offset in file, 1)
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* When a file extent is removed file truncation, we find the
|
|
|
|
* corresponding implicit back refs and check the following fields:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* (btrfs_header_owner(leaf), inode objectid, offset in file)
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* Btree extents can be referenced by:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* - Different subvolumes
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* Both the implicit back refs and the full back refs for tree blocks
|
|
|
|
* only consist of key. The key offset for the implicit back refs is
|
|
|
|
* objectid of block's owner tree. The key offset for the full back refs
|
|
|
|
* is the first byte of parent block.
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* When implicit back refs is used, information about the lowest key and
|
|
|
|
* level of the tree block are required. These information are stored in
|
|
|
|
* tree block info structure.
|
2007-12-12 01:42:00 +08:00
|
|
|
*/
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
static int convert_extent_item_v0(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 owner, u32 extra_size)
|
2007-12-11 22:25:06 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_extent_item *item;
|
|
|
|
struct btrfs_extent_item_v0 *ei0;
|
|
|
|
struct btrfs_extent_ref_v0 *ref0;
|
|
|
|
struct btrfs_tree_block_info *bi;
|
|
|
|
struct extent_buffer *leaf;
|
2007-12-11 22:25:06 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key found_key;
|
|
|
|
u32 new_size = sizeof(*item);
|
|
|
|
u64 refs;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
BUG_ON(btrfs_item_size_nr(leaf, path->slots[0]) != sizeof(*ei0));
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
ei0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_item_v0);
|
|
|
|
refs = btrfs_extent_refs_v0(leaf, ei0);
|
|
|
|
|
|
|
|
if (owner == (u64)-1) {
|
|
|
|
while (1) {
|
|
|
|
if (path->slots[0] >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret > 0); /* Corruption */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
}
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key,
|
|
|
|
path->slots[0]);
|
|
|
|
BUG_ON(key.objectid != found_key.objectid);
|
|
|
|
if (found_key.type != BTRFS_EXTENT_REF_V0_KEY) {
|
|
|
|
path->slots[0]++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
ref0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_ref_v0);
|
|
|
|
owner = btrfs_ref_objectid_v0(leaf, ref0);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID)
|
|
|
|
new_size += sizeof(*bi);
|
|
|
|
|
|
|
|
new_size -= sizeof(*ei0);
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path,
|
|
|
|
new_size + extra_size, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* Corruption */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2012-03-01 21:56:26 +08:00
|
|
|
btrfs_extend_item(trans, root, path, new_size);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
btrfs_set_extent_refs(leaf, item, refs);
|
|
|
|
/* FIXME: get real generation */
|
|
|
|
btrfs_set_extent_generation(leaf, item, 0);
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
btrfs_set_extent_flags(leaf, item,
|
|
|
|
BTRFS_EXTENT_FLAG_TREE_BLOCK |
|
|
|
|
BTRFS_BLOCK_FLAG_FULL_BACKREF);
|
|
|
|
bi = (struct btrfs_tree_block_info *)(item + 1);
|
|
|
|
/* FIXME: get first key of the block */
|
|
|
|
memset_extent_buffer(leaf, 0, (unsigned long)bi, sizeof(*bi));
|
|
|
|
btrfs_set_tree_block_level(leaf, bi, (int)owner);
|
|
|
|
} else {
|
|
|
|
btrfs_set_extent_flags(leaf, item, BTRFS_EXTENT_FLAG_DATA);
|
|
|
|
}
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset)
|
|
|
|
{
|
|
|
|
u32 high_crc = ~(u32)0;
|
|
|
|
u32 low_crc = ~(u32)0;
|
|
|
|
__le64 lenum;
|
|
|
|
|
|
|
|
lenum = cpu_to_le64(root_objectid);
|
2009-04-19 20:02:41 +08:00
|
|
|
high_crc = crc32c(high_crc, &lenum, sizeof(lenum));
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
lenum = cpu_to_le64(owner);
|
2009-04-19 20:02:41 +08:00
|
|
|
low_crc = crc32c(low_crc, &lenum, sizeof(lenum));
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
lenum = cpu_to_le64(offset);
|
2009-04-19 20:02:41 +08:00
|
|
|
low_crc = crc32c(low_crc, &lenum, sizeof(lenum));
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
return ((u64)high_crc << 31) ^ (u64)low_crc;
|
|
|
|
}
|
|
|
|
|
|
|
|
static u64 hash_extent_data_ref_item(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_extent_data_ref *ref)
|
|
|
|
{
|
|
|
|
return hash_extent_data_ref(btrfs_extent_data_ref_root(leaf, ref),
|
|
|
|
btrfs_extent_data_ref_objectid(leaf, ref),
|
|
|
|
btrfs_extent_data_ref_offset(leaf, ref));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int match_extent_data_ref(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_extent_data_ref *ref,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset)
|
|
|
|
{
|
|
|
|
if (btrfs_extent_data_ref_root(leaf, ref) != root_objectid ||
|
|
|
|
btrfs_extent_data_ref_objectid(leaf, ref) != owner ||
|
|
|
|
btrfs_extent_data_ref_offset(leaf, ref) != offset)
|
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int lookup_extent_data_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent,
|
|
|
|
u64 root_objectid,
|
|
|
|
u64 owner, u64 offset)
|
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_extent_data_ref *ref;
|
2008-09-24 01:14:14 +08:00
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 nritems;
|
2007-12-11 22:25:06 +08:00
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int recow;
|
|
|
|
int err = -ENOENT;
|
2007-12-11 22:25:06 +08:00
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
key.objectid = bytenr;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (parent) {
|
|
|
|
key.type = BTRFS_SHARED_DATA_REF_KEY;
|
|
|
|
key.offset = parent;
|
|
|
|
} else {
|
|
|
|
key.type = BTRFS_EXTENT_DATA_REF_KEY;
|
|
|
|
key.offset = hash_extent_data_ref(root_objectid,
|
|
|
|
owner, offset);
|
|
|
|
}
|
|
|
|
again:
|
|
|
|
recow = 0;
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto fail;
|
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (parent) {
|
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
key.type = BTRFS_EXTENT_REF_V0_KEY;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
goto fail;
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
while (1) {
|
|
|
|
if (path->slots[0] >= nritems) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
err = ret;
|
|
|
|
if (ret)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
recow = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
if (key.objectid != bytenr ||
|
|
|
|
key.type != BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
ref = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
|
|
|
|
if (match_extent_data_ref(leaf, ref, root_objectid,
|
|
|
|
owner, offset)) {
|
|
|
|
if (recow) {
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
err = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
path->slots[0]++;
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
fail:
|
|
|
|
return err;
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline int insert_extent_data_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner,
|
|
|
|
u64 offset, int refs_to_add)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 size;
|
2008-09-24 01:14:14 +08:00
|
|
|
u32 num_refs;
|
|
|
|
int ret;
|
2007-12-11 22:25:06 +08:00
|
|
|
|
|
|
|
key.objectid = bytenr;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (parent) {
|
|
|
|
key.type = BTRFS_SHARED_DATA_REF_KEY;
|
|
|
|
key.offset = parent;
|
|
|
|
size = sizeof(struct btrfs_shared_data_ref);
|
|
|
|
} else {
|
|
|
|
key.type = BTRFS_EXTENT_DATA_REF_KEY;
|
|
|
|
key.offset = hash_extent_data_ref(root_objectid,
|
|
|
|
owner, offset);
|
|
|
|
size = sizeof(struct btrfs_extent_data_ref);
|
|
|
|
}
|
2007-12-11 22:25:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key, size);
|
|
|
|
if (ret && ret != -EEXIST)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
if (parent) {
|
|
|
|
struct btrfs_shared_data_ref *ref;
|
2008-09-24 01:14:14 +08:00
|
|
|
ref = btrfs_item_ptr(leaf, path->slots[0],
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_shared_data_ref);
|
|
|
|
if (ret == 0) {
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, ref, refs_to_add);
|
|
|
|
} else {
|
|
|
|
num_refs = btrfs_shared_data_ref_count(leaf, ref);
|
|
|
|
num_refs += refs_to_add;
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, ref, num_refs);
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else {
|
|
|
|
struct btrfs_extent_data_ref *ref;
|
|
|
|
while (ret == -EEXIST) {
|
|
|
|
ref = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
if (match_extent_data_ref(leaf, ref, root_objectid,
|
|
|
|
owner, offset))
|
|
|
|
break;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
key.offset++;
|
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key,
|
|
|
|
size);
|
|
|
|
if (ret && ret != -EEXIST)
|
|
|
|
goto fail;
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
}
|
|
|
|
ref = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
if (ret == 0) {
|
|
|
|
btrfs_set_extent_data_ref_root(leaf, ref,
|
|
|
|
root_objectid);
|
|
|
|
btrfs_set_extent_data_ref_objectid(leaf, ref, owner);
|
|
|
|
btrfs_set_extent_data_ref_offset(leaf, ref, offset);
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, ref, refs_to_add);
|
|
|
|
} else {
|
|
|
|
num_refs = btrfs_extent_data_ref_count(leaf, ref);
|
|
|
|
num_refs += refs_to_add;
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, ref, num_refs);
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
ret = 0;
|
|
|
|
fail:
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2007-12-11 22:25:06 +08:00
|
|
|
return ret;
|
2007-12-11 22:25:06 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline int remove_extent_data_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
int refs_to_drop)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_extent_data_ref *ref1 = NULL;
|
|
|
|
struct btrfs_shared_data_ref *ref2 = NULL;
|
2008-09-24 01:14:14 +08:00
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 num_refs = 0;
|
2008-09-24 01:14:14 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
|
|
|
|
if (key.type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
ref1 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
num_refs = btrfs_extent_data_ref_count(leaf, ref1);
|
|
|
|
} else if (key.type == BTRFS_SHARED_DATA_REF_KEY) {
|
|
|
|
ref2 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_shared_data_ref);
|
|
|
|
num_refs = btrfs_shared_data_ref_count(leaf, ref2);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
} else if (key.type == BTRFS_EXTENT_REF_V0_KEY) {
|
|
|
|
struct btrfs_extent_ref_v0 *ref0;
|
|
|
|
ref0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_ref_v0);
|
|
|
|
num_refs = btrfs_ref_count_v0(leaf, ref0);
|
|
|
|
#endif
|
|
|
|
} else {
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
BUG_ON(num_refs < refs_to_drop);
|
|
|
|
num_refs -= refs_to_drop;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
if (num_refs == 0) {
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
} else {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (key.type == BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, ref1, num_refs);
|
|
|
|
else if (key.type == BTRFS_SHARED_DATA_REF_KEY)
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, ref2, num_refs);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
else {
|
|
|
|
struct btrfs_extent_ref_v0 *ref0;
|
|
|
|
ref0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_ref_v0);
|
|
|
|
btrfs_set_ref_count_v0(leaf, ref0, num_refs);
|
|
|
|
}
|
|
|
|
#endif
|
2008-09-24 01:14:14 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline u32 extent_data_ref_count(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref *iref)
|
2008-11-20 10:17:22 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_data_ref *ref1;
|
|
|
|
struct btrfs_shared_data_ref *ref2;
|
|
|
|
u32 num_refs = 0;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
if (iref) {
|
|
|
|
if (btrfs_extent_inline_ref_type(leaf, iref) ==
|
|
|
|
BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
ref1 = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
num_refs = btrfs_extent_data_ref_count(leaf, ref1);
|
|
|
|
} else {
|
|
|
|
ref2 = (struct btrfs_shared_data_ref *)(iref + 1);
|
|
|
|
num_refs = btrfs_shared_data_ref_count(leaf, ref2);
|
|
|
|
}
|
|
|
|
} else if (key.type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
ref1 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
num_refs = btrfs_extent_data_ref_count(leaf, ref1);
|
|
|
|
} else if (key.type == BTRFS_SHARED_DATA_REF_KEY) {
|
|
|
|
ref2 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_shared_data_ref);
|
|
|
|
num_refs = btrfs_shared_data_ref_count(leaf, ref2);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
} else if (key.type == BTRFS_EXTENT_REF_V0_KEY) {
|
|
|
|
struct btrfs_extent_ref_v0 *ref0;
|
|
|
|
ref0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_ref_v0);
|
|
|
|
num_refs = btrfs_ref_count_v0(leaf, ref0);
|
2008-11-20 23:22:27 +08:00
|
|
|
#endif
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else {
|
|
|
|
WARN_ON(1);
|
|
|
|
}
|
|
|
|
return num_refs;
|
|
|
|
}
|
2008-11-20 10:17:22 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline int lookup_tree_block_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent,
|
|
|
|
u64 root_objectid)
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
int ret;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
key.objectid = bytenr;
|
|
|
|
if (parent) {
|
|
|
|
key.type = BTRFS_SHARED_BLOCK_REF_KEY;
|
|
|
|
key.offset = parent;
|
|
|
|
} else {
|
|
|
|
key.type = BTRFS_TREE_BLOCK_REF_KEY;
|
|
|
|
key.offset = root_objectid;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (ret == -ENOENT && parent) {
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
key.type = BTRFS_EXTENT_REF_V0_KEY;
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
|
|
|
}
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
#endif
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return ret;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline int insert_tree_block_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent,
|
|
|
|
u64 root_objectid)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key key;
|
2008-09-24 01:14:14 +08:00
|
|
|
int ret;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
key.objectid = bytenr;
|
|
|
|
if (parent) {
|
|
|
|
key.type = BTRFS_SHARED_BLOCK_REF_KEY;
|
|
|
|
key.offset = parent;
|
|
|
|
} else {
|
|
|
|
key.type = BTRFS_TREE_BLOCK_REF_KEY;
|
|
|
|
key.offset = root_objectid;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-09-24 01:14:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static inline int extent_ref_type(u64 parent, u64 owner)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int type;
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
if (parent > 0)
|
|
|
|
type = BTRFS_SHARED_BLOCK_REF_KEY;
|
|
|
|
else
|
|
|
|
type = BTRFS_TREE_BLOCK_REF_KEY;
|
|
|
|
} else {
|
|
|
|
if (parent > 0)
|
|
|
|
type = BTRFS_SHARED_DATA_REF_KEY;
|
|
|
|
else
|
|
|
|
type = BTRFS_EXTENT_DATA_REF_KEY;
|
|
|
|
}
|
|
|
|
return type;
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
static int find_next_key(struct btrfs_path *path, int level,
|
|
|
|
struct btrfs_key *key)
|
2009-03-13 22:10:06 +08:00
|
|
|
|
2007-03-03 05:08:05 +08:00
|
|
|
{
|
2009-06-28 09:07:35 +08:00
|
|
|
for (; level < BTRFS_MAX_LEVEL; level++) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (!path->nodes[level])
|
|
|
|
break;
|
|
|
|
if (path->slots[level] + 1 >=
|
|
|
|
btrfs_header_nritems(path->nodes[level]))
|
|
|
|
continue;
|
|
|
|
if (level == 0)
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[level], key,
|
|
|
|
path->slots[level] + 1);
|
|
|
|
else
|
|
|
|
btrfs_node_key_to_cpu(path->nodes[level], key,
|
|
|
|
path->slots[level] + 1);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
2007-03-08 00:50:24 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
/*
|
|
|
|
* look for inline back ref. if back ref is found, *ref_ret is set
|
|
|
|
* to the address of inline back ref, and 0 is returned.
|
|
|
|
*
|
|
|
|
* if back ref isn't found, *ref_ret is set to the address where it
|
|
|
|
* should be inserted, and -ENOENT is returned.
|
|
|
|
*
|
|
|
|
* if insert is true and there are too many inline back refs, the path
|
|
|
|
* points to the extent item, and -EAGAIN is returned.
|
|
|
|
*
|
|
|
|
* NOTE: inline back refs are ordered in the same way that back ref
|
|
|
|
* items in the tree are ordered.
|
|
|
|
*/
|
|
|
|
static noinline_for_stack
|
|
|
|
int lookup_inline_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref **ref_ret,
|
|
|
|
u64 bytenr, u64 num_bytes,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 owner, u64 offset, int insert)
|
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
|
|
|
u64 flags;
|
|
|
|
u64 item_size;
|
|
|
|
unsigned long ptr;
|
|
|
|
unsigned long end;
|
|
|
|
int extra_size;
|
|
|
|
int type;
|
|
|
|
int want;
|
|
|
|
int ret;
|
|
|
|
int err = 0;
|
2007-08-09 08:17:12 +08:00
|
|
|
|
2007-10-16 04:15:53 +08:00
|
|
|
key.objectid = bytenr;
|
2008-09-24 01:14:14 +08:00
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
2009-03-13 22:10:06 +08:00
|
|
|
key.offset = num_bytes;
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
want = extent_ref_type(parent, owner);
|
|
|
|
if (insert) {
|
|
|
|
extra_size = btrfs_extent_inline_ref_size(want);
|
2009-06-11 20:51:10 +08:00
|
|
|
path->keep_locks = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else
|
|
|
|
extra_size = -1;
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, extra_size, 1);
|
2009-03-13 23:00:37 +08:00
|
|
|
if (ret < 0) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret && !insert) {
|
|
|
|
err = -ENOENT;
|
|
|
|
goto out;
|
2013-03-09 04:41:02 +08:00
|
|
|
} else if (ret) {
|
|
|
|
err = -EIO;
|
|
|
|
WARN_ON(1);
|
|
|
|
goto out;
|
2012-03-12 23:03:00 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (item_size < sizeof(*ei)) {
|
|
|
|
if (!insert) {
|
|
|
|
err = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
ret = convert_extent_item_v0(trans, root, path, owner,
|
|
|
|
extra_size);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
BUG_ON(item_size < sizeof(*ei));
|
|
|
|
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
flags = btrfs_extent_flags(leaf, ei);
|
|
|
|
|
|
|
|
ptr = (unsigned long)(ei + 1);
|
|
|
|
end = (unsigned long)ei + item_size;
|
|
|
|
|
|
|
|
if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
|
|
|
|
ptr += sizeof(struct btrfs_tree_block_info);
|
|
|
|
BUG_ON(ptr > end);
|
|
|
|
} else {
|
|
|
|
BUG_ON(!(flags & BTRFS_EXTENT_FLAG_DATA));
|
|
|
|
}
|
|
|
|
|
|
|
|
err = -ENOENT;
|
|
|
|
while (1) {
|
|
|
|
if (ptr >= end) {
|
|
|
|
WARN_ON(ptr > end);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)ptr;
|
|
|
|
type = btrfs_extent_inline_ref_type(leaf, iref);
|
|
|
|
if (want < type)
|
|
|
|
break;
|
|
|
|
if (want > type) {
|
|
|
|
ptr += btrfs_extent_inline_ref_size(type);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
struct btrfs_extent_data_ref *dref;
|
|
|
|
dref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
if (match_extent_data_ref(leaf, dref, root_objectid,
|
|
|
|
owner, offset)) {
|
|
|
|
err = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (hash_extent_data_ref_item(leaf, dref) <
|
|
|
|
hash_extent_data_ref(root_objectid, owner, offset))
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
u64 ref_offset;
|
|
|
|
ref_offset = btrfs_extent_inline_ref_offset(leaf, iref);
|
|
|
|
if (parent > 0) {
|
|
|
|
if (parent == ref_offset) {
|
|
|
|
err = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (ref_offset < parent)
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
if (root_objectid == ref_offset) {
|
|
|
|
err = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (ref_offset < root_objectid)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ptr += btrfs_extent_inline_ref_size(type);
|
|
|
|
}
|
|
|
|
if (err == -ENOENT && insert) {
|
|
|
|
if (item_size + extra_size >=
|
|
|
|
BTRFS_MAX_EXTENT_ITEM_SIZE(root)) {
|
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* To add new inline back ref, we have to make sure
|
|
|
|
* there is no corresponding back ref item.
|
|
|
|
* For simplicity, we just do not add new inline back
|
|
|
|
* ref if there is any kind of item for this block
|
|
|
|
*/
|
2009-06-28 09:07:35 +08:00
|
|
|
if (find_next_key(path, 0, &key) == 0 &&
|
|
|
|
key.objectid == bytenr &&
|
2009-06-11 20:51:10 +08:00
|
|
|
key.type < BTRFS_BLOCK_GROUP_ITEM_KEY) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
*ref_ret = (struct btrfs_extent_inline_ref *)ptr;
|
|
|
|
out:
|
2009-06-11 20:51:10 +08:00
|
|
|
if (insert) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path->keep_locks = 0;
|
|
|
|
btrfs_unlock_up_safe(path, 1);
|
|
|
|
}
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* helper to add new inline back ref
|
|
|
|
*/
|
|
|
|
static noinline_for_stack
|
2012-03-01 21:56:26 +08:00
|
|
|
void setup_inline_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref *iref,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 owner, u64 offset, int refs_to_add,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
{
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
unsigned long ptr;
|
|
|
|
unsigned long end;
|
|
|
|
unsigned long item_offset;
|
|
|
|
u64 refs;
|
|
|
|
int size;
|
|
|
|
int type;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
item_offset = (unsigned long)iref - (unsigned long)ei;
|
|
|
|
|
|
|
|
type = extent_ref_type(parent, owner);
|
|
|
|
size = btrfs_extent_inline_ref_size(type);
|
|
|
|
|
2012-03-01 21:56:26 +08:00
|
|
|
btrfs_extend_item(trans, root, path, size);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
refs = btrfs_extent_refs(leaf, ei);
|
|
|
|
refs += refs_to_add;
|
|
|
|
btrfs_set_extent_refs(leaf, ei, refs);
|
|
|
|
if (extent_op)
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, ei);
|
|
|
|
|
|
|
|
ptr = (unsigned long)ei + item_offset;
|
|
|
|
end = (unsigned long)ei + btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
if (ptr < end - size)
|
|
|
|
memmove_extent_buffer(leaf, ptr + size, ptr,
|
|
|
|
end - size - ptr);
|
|
|
|
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)ptr;
|
|
|
|
btrfs_set_extent_inline_ref_type(leaf, iref, type);
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
struct btrfs_extent_data_ref *dref;
|
|
|
|
dref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
btrfs_set_extent_data_ref_root(leaf, dref, root_objectid);
|
|
|
|
btrfs_set_extent_data_ref_objectid(leaf, dref, owner);
|
|
|
|
btrfs_set_extent_data_ref_offset(leaf, dref, offset);
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, dref, refs_to_add);
|
|
|
|
} else if (type == BTRFS_SHARED_DATA_REF_KEY) {
|
|
|
|
struct btrfs_shared_data_ref *sref;
|
|
|
|
sref = (struct btrfs_shared_data_ref *)(iref + 1);
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, sref, refs_to_add);
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, parent);
|
|
|
|
} else if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, parent);
|
|
|
|
} else {
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, root_objectid);
|
|
|
|
}
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int lookup_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref **ref_ret,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = lookup_inline_extent_backref(trans, root, path, ref_ret,
|
|
|
|
bytenr, num_bytes, parent,
|
|
|
|
root_objectid, owner, offset, 0);
|
|
|
|
if (ret != -ENOENT)
|
2007-06-23 02:16:25 +08:00
|
|
|
return ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
*ref_ret = NULL;
|
|
|
|
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
ret = lookup_tree_block_ref(trans, root, path, bytenr, parent,
|
|
|
|
root_objectid);
|
|
|
|
} else {
|
|
|
|
ret = lookup_extent_data_ref(trans, root, path, bytenr, parent,
|
|
|
|
root_objectid, owner, offset);
|
2009-03-13 23:00:37 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
/*
|
|
|
|
* helper to update/remove inline back ref
|
|
|
|
*/
|
|
|
|
static noinline_for_stack
|
2012-03-01 21:56:26 +08:00
|
|
|
void update_inline_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref *iref,
|
|
|
|
int refs_to_mod,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
{
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct btrfs_extent_data_ref *dref = NULL;
|
|
|
|
struct btrfs_shared_data_ref *sref = NULL;
|
|
|
|
unsigned long ptr;
|
|
|
|
unsigned long end;
|
|
|
|
u32 item_size;
|
|
|
|
int size;
|
|
|
|
int type;
|
|
|
|
u64 refs;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
refs = btrfs_extent_refs(leaf, ei);
|
|
|
|
WARN_ON(refs_to_mod < 0 && refs + refs_to_mod <= 0);
|
|
|
|
refs += refs_to_mod;
|
|
|
|
btrfs_set_extent_refs(leaf, ei, refs);
|
|
|
|
if (extent_op)
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, ei);
|
|
|
|
|
|
|
|
type = btrfs_extent_inline_ref_type(leaf, iref);
|
|
|
|
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
dref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
refs = btrfs_extent_data_ref_count(leaf, dref);
|
|
|
|
} else if (type == BTRFS_SHARED_DATA_REF_KEY) {
|
|
|
|
sref = (struct btrfs_shared_data_ref *)(iref + 1);
|
|
|
|
refs = btrfs_shared_data_ref_count(leaf, sref);
|
|
|
|
} else {
|
|
|
|
refs = 1;
|
|
|
|
BUG_ON(refs_to_mod != -1);
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG_ON(refs_to_mod < 0 && refs < -refs_to_mod);
|
|
|
|
refs += refs_to_mod;
|
|
|
|
|
|
|
|
if (refs > 0) {
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, dref, refs);
|
|
|
|
else
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, sref, refs);
|
|
|
|
} else {
|
|
|
|
size = btrfs_extent_inline_ref_size(type);
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
ptr = (unsigned long)iref;
|
|
|
|
end = (unsigned long)ei + item_size;
|
|
|
|
if (ptr + size < end)
|
|
|
|
memmove_extent_buffer(leaf, ptr, ptr + size,
|
|
|
|
end - ptr - size);
|
|
|
|
item_size -= size;
|
2012-03-01 21:56:26 +08:00
|
|
|
btrfs_truncate_item(trans, root, path, item_size, 1);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline_for_stack
|
|
|
|
int insert_inline_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner,
|
|
|
|
u64 offset, int refs_to_add,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
|
|
|
{
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = lookup_inline_extent_backref(trans, root, path, &iref,
|
|
|
|
bytenr, num_bytes, parent,
|
|
|
|
root_objectid, owner, offset, 1);
|
|
|
|
if (ret == 0) {
|
|
|
|
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
|
2012-03-01 21:56:26 +08:00
|
|
|
update_inline_extent_backref(trans, root, path, iref,
|
|
|
|
refs_to_add, extent_op);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else if (ret == -ENOENT) {
|
2012-03-01 21:56:26 +08:00
|
|
|
setup_inline_extent_backref(trans, root, path, iref, parent,
|
|
|
|
root_objectid, owner, offset,
|
|
|
|
refs_to_add, extent_op);
|
|
|
|
ret = 0;
|
2008-11-07 11:02:51 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int insert_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent, u64 root_objectid,
|
|
|
|
u64 owner, u64 offset, int refs_to_add)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
BUG_ON(refs_to_add != 1);
|
|
|
|
ret = insert_tree_block_ref(trans, root, path, bytenr,
|
|
|
|
parent, root_objectid);
|
|
|
|
} else {
|
|
|
|
ret = insert_extent_data_ref(trans, root, path, bytenr,
|
|
|
|
parent, root_objectid,
|
|
|
|
owner, offset, refs_to_add);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int remove_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref *iref,
|
|
|
|
int refs_to_drop, int is_data)
|
|
|
|
{
|
2012-03-01 21:56:26 +08:00
|
|
|
int ret = 0;
|
2009-03-13 23:00:37 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG_ON(!is_data && refs_to_drop != 1);
|
|
|
|
if (iref) {
|
2012-03-01 21:56:26 +08:00
|
|
|
update_inline_extent_backref(trans, root, path, iref,
|
|
|
|
-refs_to_drop, NULL);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else if (is_data) {
|
|
|
|
ret = remove_extent_data_ref(trans, root, path, refs_to_drop);
|
|
|
|
} else {
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-03-24 18:24:27 +08:00
|
|
|
static int btrfs_issue_discard(struct block_device *bdev,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u64 start, u64 len)
|
|
|
|
{
|
2011-03-24 18:24:27 +08:00
|
|
|
return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
|
2011-03-24 18:24:27 +08:00
|
|
|
u64 num_bytes, u64 *actual_bytes)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
{
|
|
|
|
int ret;
|
2011-03-24 18:24:27 +08:00
|
|
|
u64 discarded_bytes = 0;
|
2011-08-04 23:15:33 +08:00
|
|
|
struct btrfs_bio *bbio = NULL;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-10-14 21:24:59 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
/* Tell the block device(s) that the sectors can be discarded */
|
2012-11-05 22:46:42 +08:00
|
|
|
ret = btrfs_map_block(root->fs_info, REQ_DISCARD,
|
2011-08-04 23:15:33 +08:00
|
|
|
bytenr, &num_bytes, &bbio, 0);
|
2012-03-12 23:03:00 +08:00
|
|
|
/* Error condition is -ENOMEM */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (!ret) {
|
2011-08-04 23:15:33 +08:00
|
|
|
struct btrfs_bio_stripe *stripe = bbio->stripes;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
|
2011-08-04 23:15:33 +08:00
|
|
|
for (i = 0; i < bbio->num_stripes; i++, stripe++) {
|
2011-08-04 22:52:27 +08:00
|
|
|
if (!stripe->dev->can_discard)
|
|
|
|
continue;
|
|
|
|
|
2011-03-24 18:24:27 +08:00
|
|
|
ret = btrfs_issue_discard(stripe->dev->bdev,
|
|
|
|
stripe->physical,
|
|
|
|
stripe->length);
|
|
|
|
if (!ret)
|
|
|
|
discarded_bytes += stripe->length;
|
|
|
|
else if (ret != -EOPNOTSUPP)
|
2012-03-12 23:03:00 +08:00
|
|
|
break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
|
2011-08-04 22:52:27 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Just in case we get back EOPNOTSUPP for some reason,
|
|
|
|
* just ignore the return value so we don't screw up
|
|
|
|
* people calling discard_extent.
|
|
|
|
*/
|
|
|
|
ret = 0;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
2011-08-04 23:15:33 +08:00
|
|
|
kfree(bbio);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
2011-03-24 18:24:27 +08:00
|
|
|
|
|
|
|
if (actual_bytes)
|
|
|
|
*actual_bytes = discarded_bytes;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
if (ret == -EOPNOTSUPP)
|
|
|
|
ret = 0;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
/* Can return -ENOMEM */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
2011-09-12 21:26:38 +08:00
|
|
|
u64 root_objectid, u64 owner, u64 offset, int for_cow)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
{
|
|
|
|
int ret;
|
2011-09-12 21:26:38 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID &&
|
|
|
|
root_objectid == BTRFS_TREE_LOG_OBJECTID);
|
|
|
|
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr,
|
|
|
|
num_bytes,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
parent, root_objectid, (int)owner,
|
2011-09-12 21:26:38 +08:00
|
|
|
BTRFS_ADD_DELAYED_REF, NULL, for_cow);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else {
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
|
|
|
|
num_bytes,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
parent, root_objectid, owner, offset,
|
2011-09-12 21:26:38 +08:00
|
|
|
BTRFS_ADD_DELAYED_REF, NULL, for_cow);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 owner, u64 offset, int refs_to_add,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_item *item;
|
|
|
|
u64 refs;
|
|
|
|
int ret;
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
path->reada = 1;
|
|
|
|
path->leave_spinning = 1;
|
|
|
|
/* this will setup the path even if it fails to insert the back ref */
|
|
|
|
ret = insert_inline_extent_backref(trans, root->fs_info->extent_root,
|
|
|
|
path, bytenr, num_bytes, parent,
|
|
|
|
root_objectid, owner, offset,
|
|
|
|
refs_to_add, extent_op);
|
|
|
|
if (ret == 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (ret != -EAGAIN) {
|
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
refs = btrfs_extent_refs(leaf, item);
|
|
|
|
btrfs_set_extent_refs(leaf, item, refs + refs_to_add);
|
|
|
|
if (extent_op)
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, item);
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-03-13 22:10:06 +08:00
|
|
|
|
|
|
|
path->reada = 1;
|
2009-03-13 23:00:37 +08:00
|
|
|
path->leave_spinning = 1;
|
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
/* now insert the actual backref */
|
|
|
|
ret = insert_extent_backref(trans, root->fs_info->extent_root,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path, bytenr, parent, root_objectid,
|
|
|
|
owner, offset, refs_to_add);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret)
|
|
|
|
btrfs_abort_transaction(trans, root, ret);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
out:
|
2009-03-13 22:10:06 +08:00
|
|
|
btrfs_free_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return err;
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_delayed_ref_node *node,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
int insert_reserved)
|
2009-03-13 22:10:06 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_delayed_data_ref *ref;
|
|
|
|
struct btrfs_key ins;
|
|
|
|
u64 parent = 0;
|
|
|
|
u64 ref_root = 0;
|
|
|
|
u64 flags = 0;
|
|
|
|
|
|
|
|
ins.objectid = node->bytenr;
|
|
|
|
ins.offset = node->num_bytes;
|
|
|
|
ins.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
|
|
|
|
ref = btrfs_delayed_node_to_data_ref(node);
|
|
|
|
if (node->type == BTRFS_SHARED_DATA_REF_KEY)
|
|
|
|
parent = ref->parent;
|
|
|
|
else
|
|
|
|
ref_root = ref->root;
|
|
|
|
|
|
|
|
if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
|
|
|
|
if (extent_op) {
|
|
|
|
BUG_ON(extent_op->update_key);
|
|
|
|
flags |= extent_op->flags_to_set;
|
|
|
|
}
|
|
|
|
ret = alloc_reserved_file_extent(trans, root,
|
|
|
|
parent, ref_root, flags,
|
|
|
|
ref->objectid, ref->offset,
|
|
|
|
&ins, node->ref_mod);
|
|
|
|
} else if (node->action == BTRFS_ADD_DELAYED_REF) {
|
|
|
|
ret = __btrfs_inc_extent_ref(trans, root, node->bytenr,
|
|
|
|
node->num_bytes, parent,
|
|
|
|
ref_root, ref->objectid,
|
|
|
|
ref->offset, node->ref_mod,
|
|
|
|
extent_op);
|
|
|
|
} else if (node->action == BTRFS_DROP_DELAYED_REF) {
|
|
|
|
ret = __btrfs_free_extent(trans, root, node->bytenr,
|
|
|
|
node->num_bytes, parent,
|
|
|
|
ref_root, ref->objectid,
|
|
|
|
ref->offset, node->ref_mod,
|
|
|
|
extent_op);
|
|
|
|
} else {
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __run_delayed_extent_op(struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_extent_item *ei)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_extent_flags(leaf, ei);
|
|
|
|
if (extent_op->update_flags) {
|
|
|
|
flags |= extent_op->flags_to_set;
|
|
|
|
btrfs_set_extent_flags(leaf, ei, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (extent_op->update_key) {
|
|
|
|
struct btrfs_tree_block_info *bi;
|
|
|
|
BUG_ON(!(flags & BTRFS_EXTENT_FLAG_TREE_BLOCK));
|
|
|
|
bi = (struct btrfs_tree_block_info *)(ei + 1);
|
|
|
|
btrfs_set_tree_block_key(leaf, bi, &extent_op->key);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_delayed_ref_node *node,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
u32 item_size;
|
2009-03-13 22:10:06 +08:00
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int err = 0;
|
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
if (trans->aborted)
|
|
|
|
return 0;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = node->bytenr;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
key.offset = node->num_bytes;
|
|
|
|
|
|
|
|
path->reada = 1;
|
|
|
|
path->leave_spinning = 1;
|
|
|
|
ret = btrfs_search_slot(trans, root->fs_info->extent_root, &key,
|
|
|
|
path, 0, 1);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (ret > 0) {
|
|
|
|
err = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (item_size < sizeof(*ei)) {
|
|
|
|
ret = convert_extent_item_v0(trans, root->fs_info->extent_root,
|
|
|
|
path, (u64)-1, 0);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
BUG_ON(item_size < sizeof(*ei));
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, ei);
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return err;
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_delayed_ref_node *node,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
int insert_reserved)
|
2009-03-13 22:10:06 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_delayed_tree_ref *ref;
|
|
|
|
struct btrfs_key ins;
|
|
|
|
u64 parent = 0;
|
|
|
|
u64 ref_root = 0;
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ins.objectid = node->bytenr;
|
|
|
|
ins.offset = node->num_bytes;
|
|
|
|
ins.type = BTRFS_EXTENT_ITEM_KEY;
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ref = btrfs_delayed_node_to_tree_ref(node);
|
|
|
|
if (node->type == BTRFS_SHARED_BLOCK_REF_KEY)
|
|
|
|
parent = ref->parent;
|
|
|
|
else
|
|
|
|
ref_root = ref->root;
|
|
|
|
|
|
|
|
BUG_ON(node->ref_mod != 1);
|
|
|
|
if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
|
|
|
|
BUG_ON(!extent_op || !extent_op->update_flags ||
|
|
|
|
!extent_op->update_key);
|
|
|
|
ret = alloc_reserved_tree_block(trans, root,
|
|
|
|
parent, ref_root,
|
|
|
|
extent_op->flags_to_set,
|
|
|
|
&extent_op->key,
|
|
|
|
ref->level, &ins);
|
|
|
|
} else if (node->action == BTRFS_ADD_DELAYED_REF) {
|
|
|
|
ret = __btrfs_inc_extent_ref(trans, root, node->bytenr,
|
|
|
|
node->num_bytes, parent, ref_root,
|
|
|
|
ref->level, 0, 1, extent_op);
|
|
|
|
} else if (node->action == BTRFS_DROP_DELAYED_REF) {
|
|
|
|
ret = __btrfs_free_extent(trans, root, node->bytenr,
|
|
|
|
node->num_bytes, parent, ref_root,
|
|
|
|
ref->level, 0, 1, extent_op);
|
|
|
|
} else {
|
|
|
|
BUG();
|
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* helper function to actually process a single delayed ref entry */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_delayed_ref_node *node,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
int insert_reserved)
|
2009-03-13 22:10:06 +08:00
|
|
|
{
|
2012-03-12 23:03:00 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (trans->aborted)
|
|
|
|
return 0;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (btrfs_delayed_ref_is_head(node)) {
|
2009-03-13 22:10:06 +08:00
|
|
|
struct btrfs_delayed_ref_head *head;
|
|
|
|
/*
|
|
|
|
* we've hit the end of the chain and we were supposed
|
|
|
|
* to insert this extent into the tree. But, it got
|
|
|
|
* deleted before we ever needed to insert it, so all
|
|
|
|
* we have to do is clean up the accounting
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG_ON(extent_op);
|
|
|
|
head = btrfs_delayed_node_to_head(node);
|
2009-03-13 22:10:06 +08:00
|
|
|
if (insert_reserved) {
|
2010-05-16 22:46:25 +08:00
|
|
|
btrfs_pin_extent(root, node->bytenr,
|
|
|
|
node->num_bytes, 1);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (head->is_data) {
|
|
|
|
ret = btrfs_del_csums(trans, root,
|
|
|
|
node->bytenr,
|
|
|
|
node->num_bytes);
|
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
return ret;
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (node->type == BTRFS_TREE_BLOCK_REF_KEY ||
|
|
|
|
node->type == BTRFS_SHARED_BLOCK_REF_KEY)
|
|
|
|
ret = run_delayed_tree_ref(trans, root, node, extent_op,
|
|
|
|
insert_reserved);
|
|
|
|
else if (node->type == BTRFS_EXTENT_DATA_REF_KEY ||
|
|
|
|
node->type == BTRFS_SHARED_DATA_REF_KEY)
|
|
|
|
ret = run_delayed_data_ref(trans, root, node, extent_op,
|
|
|
|
insert_reserved);
|
|
|
|
else
|
|
|
|
BUG();
|
|
|
|
return ret;
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static noinline struct btrfs_delayed_ref_node *
|
|
|
|
select_delayed_ref(struct btrfs_delayed_ref_head *head)
|
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
int action = BTRFS_ADD_DELAYED_REF;
|
|
|
|
again:
|
|
|
|
/*
|
|
|
|
* select delayed ref of type BTRFS_ADD_DELAYED_REF first.
|
|
|
|
* this prevents ref count from going down to zero when
|
|
|
|
* there still are pending delayed ref.
|
|
|
|
*/
|
|
|
|
node = rb_prev(&head->node.rb_node);
|
|
|
|
while (1) {
|
|
|
|
if (!node)
|
|
|
|
break;
|
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node,
|
|
|
|
rb_node);
|
|
|
|
if (ref->bytenr != head->node.bytenr)
|
|
|
|
break;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (ref->action == action)
|
2009-03-13 22:10:06 +08:00
|
|
|
return ref;
|
|
|
|
node = rb_prev(node);
|
|
|
|
}
|
|
|
|
if (action == BTRFS_ADD_DELAYED_REF) {
|
|
|
|
action = BTRFS_DROP_DELAYED_REF;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
/*
|
|
|
|
* Returns 0 on success or if called with an already aborted transaction.
|
|
|
|
* Returns -ENOMEM or -EIO on failure and will abort the transaction.
|
|
|
|
*/
|
2009-03-13 22:17:05 +08:00
|
|
|
static noinline int run_clustered_refs(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct list_head *cluster)
|
2009-03-13 22:10:06 +08:00
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
struct btrfs_delayed_ref_head *locked_ref = NULL;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_delayed_extent_op *extent_op;
|
2012-06-21 17:08:04 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2009-03-13 22:10:06 +08:00
|
|
|
int ret;
|
2009-03-13 22:17:05 +08:00
|
|
|
int count = 0;
|
2009-03-13 22:10:06 +08:00
|
|
|
int must_insert_reserved = 0;
|
|
|
|
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
while (1) {
|
|
|
|
if (!locked_ref) {
|
2009-03-13 22:17:05 +08:00
|
|
|
/* pick a new head ref from the cluster list */
|
|
|
|
if (list_empty(cluster))
|
2009-03-13 22:10:06 +08:00
|
|
|
break;
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
locked_ref = list_entry(cluster->next,
|
|
|
|
struct btrfs_delayed_ref_head, cluster);
|
|
|
|
|
|
|
|
/* grab the lock that says we are going to process
|
|
|
|
* all the refs for this head */
|
|
|
|
ret = btrfs_delayed_ref_lock(trans, locked_ref);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we may have dropped the spin lock to get the head
|
|
|
|
* mutex lock, and that might have given someone else
|
|
|
|
* time to free the head. If that's true, it has been
|
|
|
|
* removed from our list and we can move on.
|
|
|
|
*/
|
|
|
|
if (ret == -EAGAIN) {
|
|
|
|
locked_ref = NULL;
|
|
|
|
count++;
|
|
|
|
continue;
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
}
|
2007-03-07 09:08:01 +08:00
|
|
|
|
2012-08-08 04:00:32 +08:00
|
|
|
/*
|
|
|
|
* We need to try and merge add/drops of the same ref since we
|
|
|
|
* can run into issues with relocate dropping the implicit ref
|
|
|
|
* and then it being added back again before the drop can
|
|
|
|
* finish. If we merged anything we need to re-loop so we can
|
|
|
|
* get a good ref.
|
|
|
|
*/
|
|
|
|
btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
|
|
|
|
locked_ref);
|
|
|
|
|
2011-09-13 21:16:43 +08:00
|
|
|
/*
|
|
|
|
* locked_ref is the head node, so we have to go one
|
|
|
|
* node back for any delayed ref updates
|
|
|
|
*/
|
|
|
|
ref = select_delayed_ref(locked_ref);
|
|
|
|
|
|
|
|
if (ref && ref->seq &&
|
2012-06-21 17:08:04 +08:00
|
|
|
btrfs_check_delayed_seq(fs_info, delayed_refs, ref->seq)) {
|
2011-09-13 21:16:43 +08:00
|
|
|
/*
|
|
|
|
* there are still refs with lower seq numbers in the
|
|
|
|
* process of being added. Don't run this ref yet.
|
|
|
|
*/
|
|
|
|
list_del_init(&locked_ref->cluster);
|
2012-12-19 16:10:10 +08:00
|
|
|
btrfs_delayed_ref_unlock(locked_ref);
|
2011-09-13 21:16:43 +08:00
|
|
|
locked_ref = NULL;
|
|
|
|
delayed_refs->num_heads_ready++;
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
cond_resched();
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
/*
|
|
|
|
* record the must insert reserved flag before we
|
|
|
|
* drop the spin lock.
|
|
|
|
*/
|
|
|
|
must_insert_reserved = locked_ref->must_insert_reserved;
|
|
|
|
locked_ref->must_insert_reserved = 0;
|
2007-12-11 22:25:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
extent_op = locked_ref->extent_op;
|
|
|
|
locked_ref->extent_op = NULL;
|
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
if (!ref) {
|
|
|
|
/* All delayed refs have been processed, Go ahead
|
|
|
|
* and send the head node to run_one_delayed_ref,
|
|
|
|
* so that any accounting fixes can happen
|
|
|
|
*/
|
|
|
|
ref = &locked_ref->node;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
if (extent_op && must_insert_reserved) {
|
2012-11-21 10:21:28 +08:00
|
|
|
btrfs_free_delayed_extent_op(extent_op);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
extent_op = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (extent_op) {
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
|
|
|
|
ret = run_delayed_extent_op(trans, root,
|
|
|
|
ref, extent_op);
|
2012-11-21 10:21:28 +08:00
|
|
|
btrfs_free_delayed_extent_op(extent_op);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) {
|
2012-12-19 16:10:10 +08:00
|
|
|
printk(KERN_DEBUG
|
|
|
|
"btrfs: run_delayed_extent_op "
|
|
|
|
"returned %d\n", ret);
|
2012-04-18 14:59:03 +08:00
|
|
|
spin_lock(&delayed_refs->lock);
|
2012-12-19 16:10:10 +08:00
|
|
|
btrfs_delayed_ref_unlock(locked_ref);
|
2012-03-12 23:03:00 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-01-07 04:23:57 +08:00
|
|
|
goto next;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
2007-03-03 05:08:05 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
ref->in_tree = 0;
|
|
|
|
rb_erase(&ref->rb_node, &delayed_refs->root);
|
|
|
|
delayed_refs->num_entries--;
|
2012-12-19 16:10:10 +08:00
|
|
|
if (!btrfs_delayed_ref_is_head(ref)) {
|
2012-08-09 14:16:53 +08:00
|
|
|
/*
|
|
|
|
* when we play the delayed ref, also correct the
|
|
|
|
* ref_mod on head
|
|
|
|
*/
|
|
|
|
switch (ref->action) {
|
|
|
|
case BTRFS_ADD_DELAYED_REF:
|
|
|
|
case BTRFS_ADD_DELAYED_EXTENT:
|
|
|
|
locked_ref->node.ref_mod -= ref->ref_mod;
|
|
|
|
break;
|
|
|
|
case BTRFS_DROP_DELAYED_REF:
|
|
|
|
locked_ref->node.ref_mod += ref->ref_mod;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
WARN_ON(1);
|
|
|
|
}
|
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
spin_unlock(&delayed_refs->lock);
|
2008-06-26 04:01:30 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = run_one_delayed_ref(trans, root, ref, extent_op,
|
2009-03-13 22:10:06 +08:00
|
|
|
must_insert_reserved);
|
2009-02-12 22:27:38 +08:00
|
|
|
|
2012-11-21 10:21:28 +08:00
|
|
|
btrfs_free_delayed_extent_op(extent_op);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) {
|
2012-12-19 16:10:10 +08:00
|
|
|
btrfs_delayed_ref_unlock(locked_ref);
|
|
|
|
btrfs_put_delayed_ref(ref);
|
|
|
|
printk(KERN_DEBUG
|
|
|
|
"btrfs: run_one_delayed_ref returned %d\n", ret);
|
2012-04-18 14:59:03 +08:00
|
|
|
spin_lock(&delayed_refs->lock);
|
2012-03-12 23:03:00 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-12-19 16:10:10 +08:00
|
|
|
/*
|
|
|
|
* If this node is a head, that means all the refs in this head
|
|
|
|
* have been dealt with, and we will pick the next head to deal
|
|
|
|
* with, so we must unlock the head and drop it from the cluster
|
|
|
|
* list before we release it.
|
|
|
|
*/
|
|
|
|
if (btrfs_delayed_ref_is_head(ref)) {
|
|
|
|
list_del_init(&locked_ref->cluster);
|
|
|
|
btrfs_delayed_ref_unlock(locked_ref);
|
|
|
|
locked_ref = NULL;
|
|
|
|
}
|
|
|
|
btrfs_put_delayed_ref(ref);
|
|
|
|
count++;
|
2012-01-07 04:23:57 +08:00
|
|
|
next:
|
2009-03-13 22:17:05 +08:00
|
|
|
cond_resched();
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
}
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2011-09-12 18:22:57 +08:00
|
|
|
#ifdef SCRAMBLE_DELAYED_REFS
|
|
|
|
/*
|
|
|
|
* Normally delayed refs get processed in ascending bytenr order. This
|
|
|
|
* correlates in most cases to the order added. To expose dependencies on this
|
|
|
|
* order, we start to process the tree in the middle instead of the beginning
|
|
|
|
*/
|
|
|
|
static u64 find_middle(struct rb_root *root)
|
|
|
|
{
|
|
|
|
struct rb_node *n = root->rb_node;
|
|
|
|
struct btrfs_delayed_ref_node *entry;
|
|
|
|
int alt = 1;
|
|
|
|
u64 middle;
|
|
|
|
u64 first = 0, last = 0;
|
|
|
|
|
|
|
|
n = rb_first(root);
|
|
|
|
if (n) {
|
|
|
|
entry = rb_entry(n, struct btrfs_delayed_ref_node, rb_node);
|
|
|
|
first = entry->bytenr;
|
|
|
|
}
|
|
|
|
n = rb_last(root);
|
|
|
|
if (n) {
|
|
|
|
entry = rb_entry(n, struct btrfs_delayed_ref_node, rb_node);
|
|
|
|
last = entry->bytenr;
|
|
|
|
}
|
|
|
|
n = root->rb_node;
|
|
|
|
|
|
|
|
while (n) {
|
|
|
|
entry = rb_entry(n, struct btrfs_delayed_ref_node, rb_node);
|
|
|
|
WARN_ON(!entry->in_tree);
|
|
|
|
|
|
|
|
middle = entry->bytenr;
|
|
|
|
|
|
|
|
if (alt)
|
|
|
|
n = n->rb_left;
|
|
|
|
else
|
|
|
|
n = n->rb_right;
|
|
|
|
|
|
|
|
alt = 1 - alt;
|
|
|
|
}
|
|
|
|
return middle;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2012-06-29 00:03:02 +08:00
|
|
|
int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct qgroup_update *qgroup_update;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (list_empty(&trans->qgroup_ref_list) !=
|
|
|
|
!trans->delayed_ref_elem.seq) {
|
|
|
|
/* list without seq or seq without list */
|
|
|
|
printk(KERN_ERR "btrfs: qgroup accounting update error, list is%s empty, seq is %llu\n",
|
|
|
|
list_empty(&trans->qgroup_ref_list) ? "" : " not",
|
|
|
|
trans->delayed_ref_elem.seq);
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!trans->delayed_ref_elem.seq)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
while (!list_empty(&trans->qgroup_ref_list)) {
|
|
|
|
qgroup_update = list_first_entry(&trans->qgroup_ref_list,
|
|
|
|
struct qgroup_update, list);
|
|
|
|
list_del(&qgroup_update->list);
|
|
|
|
if (!ret)
|
|
|
|
ret = btrfs_qgroup_account_ref(
|
|
|
|
trans, fs_info, qgroup_update->node,
|
|
|
|
qgroup_update->extent_op);
|
|
|
|
kfree(qgroup_update);
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_put_tree_mod_seq(fs_info, &trans->delayed_ref_elem);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-01-30 07:44:12 +08:00
|
|
|
static int refs_newer(struct btrfs_delayed_ref_root *delayed_refs, int seq,
|
|
|
|
int count)
|
|
|
|
{
|
|
|
|
int val = atomic_read(&delayed_refs->ref_seq);
|
|
|
|
|
|
|
|
if (val < seq || val >= seq + count)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
/*
|
|
|
|
* this starts processing the delayed reference count updates and
|
|
|
|
* extent insertions we have queued up so far. count can be
|
|
|
|
* 0, which means to process everything in the tree at the start
|
|
|
|
* of the run (but not newly added entries), or it can be some target
|
|
|
|
* number you'd like to process.
|
2012-03-12 23:03:00 +08:00
|
|
|
*
|
|
|
|
* Returns 0 on success or if called with an aborted transaction
|
|
|
|
* Returns <0 on error and aborts the transaction
|
2009-03-13 22:17:05 +08:00
|
|
|
*/
|
|
|
|
int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, unsigned long count)
|
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
struct list_head cluster;
|
|
|
|
int ret;
|
2011-12-12 23:10:07 +08:00
|
|
|
u64 delayed_start;
|
2009-03-13 22:17:05 +08:00
|
|
|
int run_all = count == (unsigned long)-1;
|
|
|
|
int run_most = 0;
|
2012-08-07 04:18:51 +08:00
|
|
|
int loops;
|
2009-03-13 22:17:05 +08:00
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
/* We'll clean this up in btrfs_cleanup_transaction */
|
|
|
|
if (trans->aborted)
|
|
|
|
return 0;
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
if (root == root->fs_info->extent_root)
|
|
|
|
root = root->fs_info->tree_root;
|
|
|
|
|
2012-06-29 00:04:55 +08:00
|
|
|
btrfs_delayed_refs_qgroup_accounting(trans, root->fs_info);
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
INIT_LIST_HEAD(&cluster);
|
2013-01-30 07:44:12 +08:00
|
|
|
if (count == 0) {
|
|
|
|
count = delayed_refs->num_entries * 2;
|
|
|
|
run_most = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!run_all && !run_most) {
|
|
|
|
int old;
|
|
|
|
int seq = atomic_read(&delayed_refs->ref_seq);
|
|
|
|
|
|
|
|
progress:
|
|
|
|
old = atomic_cmpxchg(&delayed_refs->procs_running_refs, 0, 1);
|
|
|
|
if (old) {
|
|
|
|
DEFINE_WAIT(__wait);
|
|
|
|
if (delayed_refs->num_entries < 16348)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
prepare_to_wait(&delayed_refs->wait, &__wait,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
|
|
|
|
|
|
|
old = atomic_cmpxchg(&delayed_refs->procs_running_refs, 0, 1);
|
|
|
|
if (old) {
|
|
|
|
schedule();
|
|
|
|
finish_wait(&delayed_refs->wait, &__wait);
|
|
|
|
|
|
|
|
if (!refs_newer(delayed_refs, seq, 256))
|
|
|
|
goto progress;
|
|
|
|
else
|
|
|
|
return 0;
|
|
|
|
} else {
|
|
|
|
finish_wait(&delayed_refs->wait, &__wait);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
} else {
|
|
|
|
atomic_inc(&delayed_refs->procs_running_refs);
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
again:
|
2012-08-07 04:18:51 +08:00
|
|
|
loops = 0;
|
2009-03-13 22:17:05 +08:00
|
|
|
spin_lock(&delayed_refs->lock);
|
2012-06-21 17:08:04 +08:00
|
|
|
|
2011-09-12 18:22:57 +08:00
|
|
|
#ifdef SCRAMBLE_DELAYED_REFS
|
|
|
|
delayed_refs->run_delayed_start = find_middle(&delayed_refs->root);
|
|
|
|
#endif
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
while (1) {
|
|
|
|
if (!(run_all || run_most) &&
|
|
|
|
delayed_refs->num_heads_ready < 64)
|
|
|
|
break;
|
2009-02-12 22:27:38 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
/*
|
2009-03-13 22:17:05 +08:00
|
|
|
* go find something we can process in the rbtree. We start at
|
|
|
|
* the beginning of the tree, and then build a cluster
|
|
|
|
* of refs to process starting at the first one we are able to
|
|
|
|
* lock
|
2009-03-13 22:10:06 +08:00
|
|
|
*/
|
2011-12-12 23:10:07 +08:00
|
|
|
delayed_start = delayed_refs->run_delayed_start;
|
2009-03-13 22:17:05 +08:00
|
|
|
ret = btrfs_find_ref_cluster(trans, &cluster,
|
|
|
|
delayed_refs->run_delayed_start);
|
|
|
|
if (ret)
|
2009-03-13 22:10:06 +08:00
|
|
|
break;
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
ret = run_clustered_refs(trans, root, &cluster);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0) {
|
2012-12-19 16:10:10 +08:00
|
|
|
btrfs_release_ref_cluster(&cluster);
|
2012-03-12 23:03:00 +08:00
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
btrfs_abort_transaction(trans, root, ret);
|
2013-01-30 07:44:12 +08:00
|
|
|
atomic_dec(&delayed_refs->procs_running_refs);
|
2012-03-12 23:03:00 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2009-03-13 22:17:05 +08:00
|
|
|
|
2013-01-30 07:44:12 +08:00
|
|
|
atomic_add(ret, &delayed_refs->ref_seq);
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
count -= min_t(unsigned long, ret, count);
|
|
|
|
|
|
|
|
if (count == 0)
|
|
|
|
break;
|
2011-12-12 23:10:07 +08:00
|
|
|
|
2012-08-07 04:18:51 +08:00
|
|
|
if (delayed_start >= delayed_refs->run_delayed_start) {
|
|
|
|
if (loops == 0) {
|
|
|
|
/*
|
|
|
|
* btrfs_find_ref_cluster looped. let's do one
|
|
|
|
* more cycle. if we don't run any delayed ref
|
|
|
|
* during that cycle (because we can't because
|
|
|
|
* all of them are blocked), bail out.
|
|
|
|
*/
|
|
|
|
loops = 1;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* no runnable refs left, stop trying
|
|
|
|
*/
|
|
|
|
BUG_ON(run_all);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (ret) {
|
2011-12-12 23:10:07 +08:00
|
|
|
/* refs were run, let's reset staleness detection */
|
2012-08-07 04:18:51 +08:00
|
|
|
loops = 0;
|
2011-12-12 23:10:07 +08:00
|
|
|
}
|
2009-02-12 22:27:38 +08:00
|
|
|
}
|
2009-03-13 22:17:05 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
if (run_all) {
|
2012-09-12 04:57:25 +08:00
|
|
|
if (!list_empty(&trans->new_bgs)) {
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
btrfs_create_pending_block_groups(trans, root);
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
node = rb_first(&delayed_refs->root);
|
2009-03-13 22:17:05 +08:00
|
|
|
if (!node)
|
2009-03-13 22:10:06 +08:00
|
|
|
goto out;
|
2009-03-13 22:17:05 +08:00
|
|
|
count = (unsigned long)-1;
|
2007-08-11 02:06:19 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
while (node) {
|
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node,
|
|
|
|
rb_node);
|
|
|
|
if (btrfs_delayed_ref_is_head(ref)) {
|
|
|
|
struct btrfs_delayed_ref_head *head;
|
2007-04-02 23:20:42 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
head = btrfs_delayed_node_to_head(ref);
|
|
|
|
atomic_inc(&ref->refs);
|
|
|
|
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
2011-05-02 21:29:25 +08:00
|
|
|
/*
|
|
|
|
* Mutex was contended, block until it's
|
|
|
|
* released and try again
|
|
|
|
*/
|
2009-03-13 22:10:06 +08:00
|
|
|
mutex_lock(&head->mutex);
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
|
|
|
|
btrfs_put_delayed_ref(ref);
|
2009-03-13 22:11:24 +08:00
|
|
|
cond_resched();
|
2009-03-13 22:10:06 +08:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
node = rb_next(node);
|
|
|
|
}
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
schedule_timeout(1);
|
|
|
|
goto again;
|
2007-10-16 04:14:19 +08:00
|
|
|
}
|
2007-06-23 02:16:25 +08:00
|
|
|
out:
|
2013-01-30 07:44:12 +08:00
|
|
|
atomic_dec(&delayed_refs->procs_running_refs);
|
|
|
|
smp_mb();
|
|
|
|
if (waitqueue_active(&delayed_refs->wait))
|
|
|
|
wake_up(&delayed_refs->wait);
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
spin_unlock(&delayed_refs->lock);
|
2012-06-29 00:04:55 +08:00
|
|
|
assert_qgroups_uptodate(trans);
|
2007-03-07 09:08:01 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 flags,
|
|
|
|
int is_data)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_extent_op *extent_op;
|
|
|
|
int ret;
|
|
|
|
|
2012-11-21 10:21:28 +08:00
|
|
|
extent_op = btrfs_alloc_delayed_extent_op();
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (!extent_op)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
extent_op->flags_to_set = flags;
|
|
|
|
extent_op->update_flags = 1;
|
|
|
|
extent_op->update_key = 0;
|
|
|
|
extent_op->is_data = is_data ? 1 : 0;
|
|
|
|
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_add_delayed_extent_op(root->fs_info, trans, bytenr,
|
|
|
|
num_bytes, extent_op);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (ret)
|
2012-11-21 10:21:28 +08:00
|
|
|
btrfs_free_delayed_extent_op(extent_op);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int check_delayed_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 objectid, u64 offset, u64 bytenr)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_head *head;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
struct btrfs_delayed_data_ref *data_ref;
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct rb_node *node;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
ret = -ENOENT;
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
head = btrfs_find_delayed_ref_head(trans, bytenr);
|
|
|
|
if (!head)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (!mutex_trylock(&head->mutex)) {
|
|
|
|
atomic_inc(&head->node.refs);
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2011-05-02 21:29:25 +08:00
|
|
|
/*
|
|
|
|
* Mutex was contended, block until it's released and let
|
|
|
|
* caller try again
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
mutex_lock(&head->mutex);
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
btrfs_put_delayed_ref(&head->node);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
|
|
|
node = rb_prev(&head->node.rb_node);
|
|
|
|
if (!node)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
|
|
|
|
|
|
|
|
if (ref->bytenr != bytenr)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
ret = 1;
|
|
|
|
if (ref->type != BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
data_ref = btrfs_delayed_node_to_data_ref(ref);
|
|
|
|
|
|
|
|
node = rb_prev(node);
|
|
|
|
if (node) {
|
2012-07-23 19:50:03 +08:00
|
|
|
int seq = ref->seq;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
|
2012-07-23 19:50:03 +08:00
|
|
|
if (ref->bytenr == bytenr && ref->seq == seq)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (data_ref->root != root->root_key.objectid ||
|
|
|
|
data_ref->objectid != objectid || data_ref->offset != offset)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
out:
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int check_committed_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 objectid, u64 offset, u64 bytenr)
|
2007-12-18 09:14:01 +08:00
|
|
|
{
|
|
|
|
struct btrfs_root *extent_root = root->fs_info->extent_root;
|
2008-07-30 21:26:11 +08:00
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_extent_data_ref *ref;
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
|
|
|
struct btrfs_extent_item *ei;
|
2008-07-30 21:26:11 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 item_size;
|
2007-12-18 09:14:01 +08:00
|
|
|
int ret;
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2007-12-18 09:14:01 +08:00
|
|
|
key.objectid = bytenr;
|
2008-09-24 01:14:14 +08:00
|
|
|
key.offset = (u64)-1;
|
2008-07-30 21:26:11 +08:00
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
2007-12-18 09:14:01 +08:00
|
|
|
|
|
|
|
ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret == 0); /* Corruption */
|
2008-10-31 02:20:02 +08:00
|
|
|
|
|
|
|
ret = -ENOENT;
|
|
|
|
if (path->slots[0] == 0)
|
2008-09-24 01:14:14 +08:00
|
|
|
goto out;
|
2007-12-18 09:14:01 +08:00
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
path->slots[0]--;
|
2008-07-30 21:26:11 +08:00
|
|
|
leaf = path->nodes[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
2007-12-18 09:14:01 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (key.objectid != bytenr || key.type != BTRFS_EXTENT_ITEM_KEY)
|
2007-12-18 09:14:01 +08:00
|
|
|
goto out;
|
2008-07-30 21:26:11 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = 1;
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (item_size < sizeof(*ei)) {
|
|
|
|
WARN_ON(item_size != sizeof(struct btrfs_extent_item_v0));
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
2008-01-04 02:23:19 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (item_size != sizeof(*ei) +
|
|
|
|
btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY))
|
|
|
|
goto out;
|
2007-12-18 09:14:01 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (btrfs_extent_generation(leaf, ei) <=
|
|
|
|
btrfs_root_last_snapshot(&root->root_item))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)(ei + 1);
|
|
|
|
if (btrfs_extent_inline_ref_type(leaf, iref) !=
|
|
|
|
BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
if (btrfs_extent_refs(leaf, ei) !=
|
|
|
|
btrfs_extent_data_ref_count(leaf, ref) ||
|
|
|
|
btrfs_extent_data_ref_root(leaf, ref) !=
|
|
|
|
root->root_key.objectid ||
|
|
|
|
btrfs_extent_data_ref_objectid(leaf, ref) != objectid ||
|
|
|
|
btrfs_extent_data_ref_offset(leaf, ref) != offset)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 objectid, u64 offset, u64 bytenr)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
int ret;
|
|
|
|
int ret2;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
do {
|
|
|
|
ret = check_committed_ref(trans, root, path, objectid,
|
|
|
|
offset, bytenr);
|
|
|
|
if (ret && ret != -ENOENT)
|
2008-07-30 21:26:11 +08:00
|
|
|
goto out;
|
2008-10-31 02:20:02 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret2 = check_delayed_ref(trans, root, path, objectid,
|
|
|
|
offset, bytenr);
|
|
|
|
} while (ret2 == -EAGAIN);
|
|
|
|
|
|
|
|
if (ret2 && ret2 != -ENOENT) {
|
|
|
|
ret = ret2;
|
|
|
|
goto out;
|
2008-07-30 21:26:11 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
if (ret != -ENOENT || ret2 != -ENOENT)
|
|
|
|
ret = 0;
|
2007-12-18 09:14:01 +08:00
|
|
|
out:
|
2008-10-31 02:20:02 +08:00
|
|
|
btrfs_free_path(path);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
|
|
|
|
WARN_ON(ret > 0);
|
2008-07-30 21:26:11 +08:00
|
|
|
return ret;
|
2007-12-18 09:14:01 +08:00
|
|
|
}
|
2007-04-10 21:27:04 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
|
2009-02-04 22:23:45 +08:00
|
|
|
struct btrfs_root *root,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct extent_buffer *buf,
|
2011-09-12 21:26:38 +08:00
|
|
|
int full_backref, int inc, int for_cow)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
|
|
|
u64 bytenr;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u64 num_bytes;
|
|
|
|
u64 parent;
|
2008-09-24 01:14:14 +08:00
|
|
|
u64 ref_root;
|
|
|
|
u32 nritems;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_file_extent_item *fi;
|
|
|
|
int i;
|
|
|
|
int level;
|
|
|
|
int ret = 0;
|
|
|
|
int (*process_func)(struct btrfs_trans_handle *, struct btrfs_root *,
|
2011-09-12 21:26:38 +08:00
|
|
|
u64, u64, u64, u64, u64, u64, int);
|
2008-09-24 01:14:14 +08:00
|
|
|
|
|
|
|
ref_root = btrfs_header_owner(buf);
|
|
|
|
nritems = btrfs_header_nritems(buf);
|
|
|
|
level = btrfs_header_level(buf);
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (!root->ref_cows && level == 0)
|
|
|
|
return 0;
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (inc)
|
|
|
|
process_func = btrfs_inc_extent_ref;
|
|
|
|
else
|
|
|
|
process_func = btrfs_free_extent;
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (full_backref)
|
|
|
|
parent = buf->start;
|
|
|
|
else
|
|
|
|
parent = 0;
|
|
|
|
|
|
|
|
for (i = 0; i < nritems; i++) {
|
2008-09-24 01:14:14 +08:00
|
|
|
if (level == 0) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_item_key_to_cpu(buf, &key, i);
|
2008-09-24 01:14:14 +08:00
|
|
|
if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY)
|
|
|
|
continue;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
fi = btrfs_item_ptr(buf, i,
|
2008-09-24 01:14:14 +08:00
|
|
|
struct btrfs_file_extent_item);
|
|
|
|
if (btrfs_file_extent_type(buf, fi) ==
|
|
|
|
BTRFS_FILE_EXTENT_INLINE)
|
|
|
|
continue;
|
|
|
|
bytenr = btrfs_file_extent_disk_bytenr(buf, fi);
|
|
|
|
if (bytenr == 0)
|
|
|
|
continue;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
|
|
|
|
key.offset -= btrfs_file_extent_offset(buf, fi);
|
|
|
|
ret = process_func(trans, root, bytenr, num_bytes,
|
|
|
|
parent, ref_root, key.objectid,
|
2011-09-12 21:26:38 +08:00
|
|
|
key.offset, for_cow);
|
2008-09-24 01:14:14 +08:00
|
|
|
if (ret)
|
|
|
|
goto fail;
|
|
|
|
} else {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
bytenr = btrfs_node_blockptr(buf, i);
|
|
|
|
num_bytes = btrfs_level_size(root, level - 1);
|
|
|
|
ret = process_func(trans, root, bytenr, num_bytes,
|
2011-09-12 21:26:38 +08:00
|
|
|
parent, ref_root, level - 1, 0,
|
|
|
|
for_cow);
|
2008-09-24 01:14:14 +08:00
|
|
|
if (ret)
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
fail:
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
2011-09-12 21:26:38 +08:00
|
|
|
struct extent_buffer *buf, int full_backref, int for_cow)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
{
|
2011-09-12 21:26:38 +08:00
|
|
|
return __btrfs_mod_ref(trans, root, buf, full_backref, 1, for_cow);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
2011-09-12 21:26:38 +08:00
|
|
|
struct extent_buffer *buf, int full_backref, int for_cow)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
{
|
2011-09-12 21:26:38 +08:00
|
|
|
return __btrfs_mod_ref(trans, root, buf, full_backref, 0, for_cow);
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
|
|
|
|
2007-04-27 04:46:15 +08:00
|
|
|
static int write_one_cache_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_root *extent_root = root->fs_info->extent_root;
|
2007-10-16 04:14:19 +08:00
|
|
|
unsigned long bi;
|
|
|
|
struct extent_buffer *leaf;
|
2007-04-27 04:46:15 +08:00
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, extent_root, &cache->key, path, 0, 1);
|
2007-06-23 02:16:25 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto fail;
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* Corruption */
|
2007-10-16 04:14:19 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
|
|
|
|
write_extent_buffer(leaf, &cache->item, bi, sizeof(cache->item));
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2007-06-23 02:16:25 +08:00
|
|
|
fail:
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, root, ret);
|
2007-04-27 04:46:15 +08:00
|
|
|
return ret;
|
2012-03-12 23:03:00 +08:00
|
|
|
}
|
2007-04-27 04:46:15 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2009-07-22 22:07:05 +08:00
|
|
|
static struct btrfs_block_group_cache *
|
|
|
|
next_block_group(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
spin_lock(&root->fs_info->block_group_cache_lock);
|
|
|
|
node = rb_next(&cache->cache_node);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
if (node) {
|
|
|
|
cache = rb_entry(node, struct btrfs_block_group_cache,
|
|
|
|
cache_node);
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_get_block_group(cache);
|
2009-07-22 22:07:05 +08:00
|
|
|
} else
|
|
|
|
cache = NULL;
|
|
|
|
spin_unlock(&root->fs_info->block_group_cache_lock);
|
|
|
|
return cache;
|
|
|
|
}
|
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
static int cache_save_setup(struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = block_group->fs_info->tree_root;
|
|
|
|
struct inode *inode = NULL;
|
|
|
|
u64 alloc_hint = 0;
|
2010-12-04 02:17:53 +08:00
|
|
|
int dcs = BTRFS_DC_ERROR;
|
2010-06-22 02:48:16 +08:00
|
|
|
int num_pages = 0;
|
|
|
|
int retries = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this block group is smaller than 100 megs don't bother caching the
|
|
|
|
* block group.
|
|
|
|
*/
|
|
|
|
if (block_group->key.offset < (100 * 1024 * 1024)) {
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
block_group->disk_cache_state = BTRFS_DC_WRITTEN;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
again:
|
|
|
|
inode = lookup_free_space_inode(root, block_group, path);
|
|
|
|
if (IS_ERR(inode) && PTR_ERR(inode) != -ENOENT) {
|
|
|
|
ret = PTR_ERR(inode);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-06-22 02:48:16 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
BUG_ON(retries);
|
|
|
|
retries++;
|
|
|
|
|
|
|
|
if (block_group->ro)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
ret = create_free_space_inode(root, trans, block_group, path);
|
|
|
|
if (ret)
|
|
|
|
goto out_free;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
2011-10-06 20:58:24 +08:00
|
|
|
/* We've already setup this transaction, go ahead and exit */
|
|
|
|
if (block_group->cache_generation == trans->transid &&
|
|
|
|
i_size_read(inode)) {
|
|
|
|
dcs = BTRFS_DC_SETUP;
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
/*
|
|
|
|
* We want to set the generation to 0, that way if anything goes wrong
|
|
|
|
* from here on out we know not to trust this cache when we load up next
|
|
|
|
* time.
|
|
|
|
*/
|
|
|
|
BTRFS_I(inode)->generation = 0;
|
|
|
|
ret = btrfs_update_inode(trans, root, inode);
|
|
|
|
WARN_ON(ret);
|
|
|
|
|
|
|
|
if (i_size_read(inode) > 0) {
|
|
|
|
ret = btrfs_truncate_free_space_cache(root, trans, path,
|
|
|
|
inode);
|
|
|
|
if (ret)
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
Btrfs: fix a bug of writting free space cache during balance
Here is the whole story:
1)
A free space cache consists of two parts:
o free space cache inode, which is special becase it's stored in root tree.
o free space info, which is stored as the above inode's file data.
But we only build up another new inode and does not flush its free space info
onto disk when we _clear and setup_ free space cache, and this ends up with
that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.
And holding DC_SETUP means that we will not truncate this free space cache inode,
which means the disk offset of its file extent will remain _unchanged_ at least
until next transaction finishes committing itself.
2)
We can set a block group readonly when we relocate the block group.
However,
if the readonly block group covers the disk offset where our free space cache
inode is going to write, it will force the free space cache inode into
cow_file_range() and it'll end up hitting a BUG_ON.
3)
Due to the above analysis, we fix this bug by adding the missing dirty flag.
4)
However, it's not over, there is still another case, nospace_cache.
With nospace_cache, we do not want to set dirty flag, instead we just truncate
free space cache inode and bail out with setting cache state DC_WRITTEN.
We can benifit from it since it saves us another 'pre-allocation' part which
usually costs a lot.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-06 17:31:34 +08:00
|
|
|
if (block_group->cached != BTRFS_CACHE_FINISHED ||
|
|
|
|
!btrfs_test_opt(root, SPACE_CACHE)) {
|
|
|
|
/*
|
|
|
|
* don't bother trying to write stuff out _if_
|
|
|
|
* a) we're not cached,
|
|
|
|
* b) we're with nospace_cache mount option.
|
|
|
|
*/
|
2010-12-04 02:17:53 +08:00
|
|
|
dcs = BTRFS_DC_WRITTEN;
|
2010-06-22 02:48:16 +08:00
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
2012-08-07 03:46:38 +08:00
|
|
|
/*
|
|
|
|
* Try to preallocate enough space based on how big the block group is.
|
|
|
|
* Keep in mind this has to include any pinned space which could end up
|
|
|
|
* taking up quite a bit since it's not folded into the other space
|
|
|
|
* cache.
|
|
|
|
*/
|
|
|
|
num_pages = (int)div64_u64(block_group->key.offset, 256 * 1024 * 1024);
|
2010-06-22 02:48:16 +08:00
|
|
|
if (!num_pages)
|
|
|
|
num_pages = 1;
|
|
|
|
|
|
|
|
num_pages *= 16;
|
|
|
|
num_pages *= PAGE_CACHE_SIZE;
|
|
|
|
|
|
|
|
ret = btrfs_check_data_free_space(inode, num_pages);
|
|
|
|
if (ret)
|
|
|
|
goto out_put;
|
|
|
|
|
|
|
|
ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages,
|
|
|
|
num_pages, num_pages,
|
|
|
|
&alloc_hint);
|
2010-12-04 02:17:53 +08:00
|
|
|
if (!ret)
|
|
|
|
dcs = BTRFS_DC_SETUP;
|
2010-06-22 02:48:16 +08:00
|
|
|
btrfs_free_reserved_data_space(inode, num_pages);
|
2011-08-30 22:19:10 +08:00
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
out_put:
|
|
|
|
iput(inode);
|
|
|
|
out_free:
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-06-22 02:48:16 +08:00
|
|
|
out:
|
|
|
|
spin_lock(&block_group->lock);
|
2011-12-14 05:04:54 +08:00
|
|
|
if (!ret && dcs == BTRFS_DC_SETUP)
|
2011-10-06 20:58:24 +08:00
|
|
|
block_group->cache_generation = trans->transid;
|
2010-12-04 02:17:53 +08:00
|
|
|
block_group->disk_cache_state = dcs;
|
2010-06-22 02:48:16 +08:00
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2007-10-16 04:15:19 +08:00
|
|
|
int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root)
|
2007-04-27 04:46:15 +08:00
|
|
|
{
|
2009-07-22 22:07:05 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2007-04-27 04:46:15 +08:00
|
|
|
int err = 0;
|
|
|
|
struct btrfs_path *path;
|
2007-10-16 04:15:19 +08:00
|
|
|
u64 last = 0;
|
2007-04-27 04:46:15 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
again:
|
|
|
|
while (1) {
|
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, last);
|
|
|
|
while (cache) {
|
|
|
|
if (cache->disk_cache_state == BTRFS_DC_CLEAR)
|
|
|
|
break;
|
|
|
|
cache = next_block_group(root, cache);
|
|
|
|
}
|
|
|
|
if (!cache) {
|
|
|
|
if (last == 0)
|
|
|
|
break;
|
|
|
|
last = 0;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
err = cache_save_setup(cache, trans, path);
|
|
|
|
last = cache->key.objectid + cache->key.offset;
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
}
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2009-07-22 22:07:05 +08:00
|
|
|
if (last == 0) {
|
|
|
|
err = btrfs_run_delayed_refs(trans, root,
|
|
|
|
(unsigned long)-1);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (err) /* File system offline */
|
|
|
|
goto out;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
2007-06-23 02:16:25 +08:00
|
|
|
|
2009-07-22 22:07:05 +08:00
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, last);
|
|
|
|
while (cache) {
|
2010-06-22 02:48:16 +08:00
|
|
|
if (cache->disk_cache_state == BTRFS_DC_CLEAR) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
2009-07-22 22:07:05 +08:00
|
|
|
if (cache->dirty)
|
|
|
|
break;
|
|
|
|
cache = next_block_group(root, cache);
|
|
|
|
}
|
|
|
|
if (!cache) {
|
|
|
|
if (last == 0)
|
|
|
|
break;
|
|
|
|
last = 0;
|
|
|
|
continue;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2010-07-03 00:14:14 +08:00
|
|
|
if (cache->disk_cache_state == BTRFS_DC_SETUP)
|
|
|
|
cache->disk_cache_state = BTRFS_DC_NEED_WRITE;
|
2008-09-26 22:05:48 +08:00
|
|
|
cache->dirty = 0;
|
2009-07-22 22:07:05 +08:00
|
|
|
last = cache->key.objectid + cache->key.offset;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2009-07-22 22:07:05 +08:00
|
|
|
err = write_one_cache_group(trans, root, path, cache);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (err) /* File system offline */
|
|
|
|
goto out;
|
|
|
|
|
2009-07-22 22:07:05 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2009-07-22 22:07:05 +08:00
|
|
|
|
2010-07-03 00:14:14 +08:00
|
|
|
while (1) {
|
|
|
|
/*
|
|
|
|
* I don't think this is needed since we're just marking our
|
|
|
|
* preallocated extent as written, but just in case it can't
|
|
|
|
* hurt.
|
|
|
|
*/
|
|
|
|
if (last == 0) {
|
|
|
|
err = btrfs_run_delayed_refs(trans, root,
|
|
|
|
(unsigned long)-1);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (err) /* File system offline */
|
|
|
|
goto out;
|
2010-07-03 00:14:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, last);
|
|
|
|
while (cache) {
|
|
|
|
/*
|
|
|
|
* Really this shouldn't happen, but it could if we
|
|
|
|
* couldn't write the entire preallocated extent and
|
|
|
|
* splitting the extent resulted in a new block.
|
|
|
|
*/
|
|
|
|
if (cache->dirty) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
if (cache->disk_cache_state == BTRFS_DC_NEED_WRITE)
|
|
|
|
break;
|
|
|
|
cache = next_block_group(root, cache);
|
|
|
|
}
|
|
|
|
if (!cache) {
|
|
|
|
if (last == 0)
|
|
|
|
break;
|
|
|
|
last = 0;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
err = btrfs_write_out_cache(root, trans, cache, path);
|
2010-07-03 00:14:14 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we didn't have an error then the cache state is still
|
|
|
|
* NEED_WRITE, so we can set it to WRITTEN.
|
|
|
|
*/
|
2012-03-12 23:03:00 +08:00
|
|
|
if (!err && cache->disk_cache_state == BTRFS_DC_NEED_WRITE)
|
2010-07-03 00:14:14 +08:00
|
|
|
cache->disk_cache_state = BTRFS_DC_WRITTEN;
|
|
|
|
last = cache->key.objectid + cache->key.offset;
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
out:
|
2010-07-03 00:14:14 +08:00
|
|
|
|
2007-04-27 04:46:15 +08:00
|
|
|
btrfs_free_path(path);
|
2012-03-12 23:03:00 +08:00
|
|
|
return err;
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
int btrfs_extent_readonly(struct btrfs_root *root, u64 bytenr)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
int readonly = 0;
|
|
|
|
|
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info, bytenr);
|
|
|
|
if (!block_group || block_group->ro)
|
|
|
|
readonly = 1;
|
|
|
|
if (block_group)
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2008-12-12 05:30:39 +08:00
|
|
|
return readonly;
|
|
|
|
}
|
|
|
|
|
2008-03-26 04:50:33 +08:00
|
|
|
static int update_space_info(struct btrfs_fs_info *info, u64 flags,
|
|
|
|
u64 total_bytes, u64 bytes_used,
|
|
|
|
struct btrfs_space_info **space_info)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *found;
|
2010-05-16 22:46:24 +08:00
|
|
|
int i;
|
|
|
|
int factor;
|
|
|
|
|
|
|
|
if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
factor = 2;
|
|
|
|
else
|
|
|
|
factor = 1;
|
2008-03-26 04:50:33 +08:00
|
|
|
|
|
|
|
found = __find_space_info(info, flags);
|
|
|
|
if (found) {
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_lock(&found->lock);
|
2008-03-26 04:50:33 +08:00
|
|
|
found->total_bytes += total_bytes;
|
2010-10-15 02:52:27 +08:00
|
|
|
found->disk_total += total_bytes * factor;
|
2008-03-26 04:50:33 +08:00
|
|
|
found->bytes_used += bytes_used;
|
2010-05-16 22:46:24 +08:00
|
|
|
found->disk_used += bytes_used * factor;
|
2008-04-26 04:53:30 +08:00
|
|
|
found->full = 0;
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&found->lock);
|
2008-03-26 04:50:33 +08:00
|
|
|
*space_info = found;
|
|
|
|
return 0;
|
|
|
|
}
|
2008-11-13 03:34:12 +08:00
|
|
|
found = kzalloc(sizeof(*found), GFP_NOFS);
|
2008-03-26 04:50:33 +08:00
|
|
|
if (!found)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
|
|
|
|
INIT_LIST_HEAD(&found->block_groups[i]);
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
init_rwsem(&found->groups_sem);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
spin_lock_init(&found->lock);
|
2012-01-17 04:04:47 +08:00
|
|
|
found->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
|
2008-03-26 04:50:33 +08:00
|
|
|
found->total_bytes = total_bytes;
|
2010-10-15 02:52:27 +08:00
|
|
|
found->disk_total = total_bytes * factor;
|
2008-03-26 04:50:33 +08:00
|
|
|
found->bytes_used = bytes_used;
|
2010-05-16 22:46:24 +08:00
|
|
|
found->disk_used = bytes_used * factor;
|
2008-03-26 04:50:33 +08:00
|
|
|
found->bytes_pinned = 0;
|
2008-09-26 22:05:48 +08:00
|
|
|
found->bytes_reserved = 0;
|
2008-11-13 03:34:12 +08:00
|
|
|
found->bytes_readonly = 0;
|
2010-05-16 22:46:25 +08:00
|
|
|
found->bytes_may_use = 0;
|
2008-03-26 04:50:33 +08:00
|
|
|
found->full = 0;
|
2011-04-16 04:05:44 +08:00
|
|
|
found->force_alloc = CHUNK_ALLOC_NO_FORCE;
|
2011-04-12 08:20:11 +08:00
|
|
|
found->chunk_alloc = 0;
|
2011-06-08 04:07:44 +08:00
|
|
|
found->flush = 0;
|
|
|
|
init_waitqueue_head(&found->wait);
|
2008-03-26 04:50:33 +08:00
|
|
|
*space_info = found;
|
2009-03-11 00:39:20 +08:00
|
|
|
list_add_rcu(&found->list, &info->space_info);
|
2012-07-10 10:21:07 +08:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
info->data_sinfo = found;
|
2008-03-26 04:50:33 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-04-04 04:29:03 +08:00
|
|
|
static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
|
|
|
|
{
|
2012-03-27 22:09:16 +08:00
|
|
|
u64 extra_flags = chunk_to_extended(flags) &
|
|
|
|
BTRFS_EXTENDED_PROFILE_MASK;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2013-01-29 18:13:12 +08:00
|
|
|
write_seqlock(&fs_info->profiles_lock);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
fs_info->avail_data_alloc_bits |= extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
fs_info->avail_metadata_alloc_bits |= extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
fs_info->avail_system_alloc_bits |= extra_flags;
|
2013-01-29 18:13:12 +08:00
|
|
|
write_sequnlock(&fs_info->profiles_lock);
|
2008-04-04 04:29:03 +08:00
|
|
|
}
|
2008-03-26 04:50:33 +08:00
|
|
|
|
2012-03-27 22:09:17 +08:00
|
|
|
/*
|
|
|
|
* returns target flags in extended format or 0 if restripe for this
|
|
|
|
* chunk_type is not in progress
|
2012-04-13 04:03:56 +08:00
|
|
|
*
|
|
|
|
* should be called with either volume_mutex or balance_lock held
|
2012-03-27 22:09:17 +08:00
|
|
|
*/
|
|
|
|
static u64 get_restripe_target(struct btrfs_fs_info *fs_info, u64 flags)
|
|
|
|
{
|
|
|
|
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
|
|
|
u64 target = 0;
|
|
|
|
|
|
|
|
if (!bctl)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA &&
|
|
|
|
bctl->data.flags & BTRFS_BALANCE_ARGS_CONVERT) {
|
|
|
|
target = BTRFS_BLOCK_GROUP_DATA | bctl->data.target;
|
|
|
|
} else if (flags & BTRFS_BLOCK_GROUP_SYSTEM &&
|
|
|
|
bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) {
|
|
|
|
target = BTRFS_BLOCK_GROUP_SYSTEM | bctl->sys.target;
|
|
|
|
} else if (flags & BTRFS_BLOCK_GROUP_METADATA &&
|
|
|
|
bctl->meta.flags & BTRFS_BALANCE_ARGS_CONVERT) {
|
|
|
|
target = BTRFS_BLOCK_GROUP_METADATA | bctl->meta.target;
|
|
|
|
}
|
|
|
|
|
|
|
|
return target;
|
|
|
|
}
|
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/*
|
|
|
|
* @flags: available profiles in extended format (see ctree.h)
|
|
|
|
*
|
2012-01-17 04:04:48 +08:00
|
|
|
* Returns reduced profile in chunk format. If profile changing is in
|
|
|
|
* progress (either running or paused) picks the target profile (if it's
|
|
|
|
* already available), otherwise falls back to plain reducing.
|
2012-01-17 04:04:47 +08:00
|
|
|
*/
|
2008-11-18 10:11:30 +08:00
|
|
|
u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags)
|
2008-04-29 03:29:52 +08:00
|
|
|
{
|
2010-12-14 03:56:23 +08:00
|
|
|
/*
|
|
|
|
* we add in the count of missing devices because we want
|
|
|
|
* to make sure that any RAID levels on a degraded FS
|
|
|
|
* continue to be honored.
|
|
|
|
*/
|
|
|
|
u64 num_devices = root->fs_info->fs_devices->rw_devices +
|
|
|
|
root->fs_info->fs_devices->missing_devices;
|
2012-03-27 22:09:17 +08:00
|
|
|
u64 target;
|
2013-01-30 07:40:14 +08:00
|
|
|
u64 tmp;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
2012-03-27 22:09:17 +08:00
|
|
|
/*
|
|
|
|
* see if restripe for this chunk_type is in progress, if so
|
|
|
|
* try to reduce to the target profile
|
|
|
|
*/
|
2012-01-17 04:04:48 +08:00
|
|
|
spin_lock(&root->fs_info->balance_lock);
|
2012-03-27 22:09:17 +08:00
|
|
|
target = get_restripe_target(root->fs_info, flags);
|
|
|
|
if (target) {
|
|
|
|
/* pick target profile only if it's already available */
|
|
|
|
if ((flags & target) & BTRFS_EXTENDED_PROFILE_MASK) {
|
2012-01-17 04:04:48 +08:00
|
|
|
spin_unlock(&root->fs_info->balance_lock);
|
2012-03-27 22:09:17 +08:00
|
|
|
return extended_to_chunk(target);
|
2012-01-17 04:04:48 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock(&root->fs_info->balance_lock);
|
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
/* First, mask out the RAID levels which aren't possible */
|
2008-05-07 23:43:44 +08:00
|
|
|
if (num_devices == 1)
|
2013-01-30 07:40:14 +08:00
|
|
|
flags &= ~(BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID5);
|
|
|
|
if (num_devices < 3)
|
|
|
|
flags &= ~BTRFS_BLOCK_GROUP_RAID6;
|
2008-05-07 23:43:44 +08:00
|
|
|
if (num_devices < 4)
|
|
|
|
flags &= ~BTRFS_BLOCK_GROUP_RAID10;
|
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
tmp = flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID5 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID6 | BTRFS_BLOCK_GROUP_RAID10);
|
|
|
|
flags &= ~tmp;
|
2008-04-29 03:29:52 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
if (tmp & BTRFS_BLOCK_GROUP_RAID6)
|
|
|
|
tmp = BTRFS_BLOCK_GROUP_RAID6;
|
|
|
|
else if (tmp & BTRFS_BLOCK_GROUP_RAID5)
|
|
|
|
tmp = BTRFS_BLOCK_GROUP_RAID5;
|
|
|
|
else if (tmp & BTRFS_BLOCK_GROUP_RAID10)
|
|
|
|
tmp = BTRFS_BLOCK_GROUP_RAID10;
|
|
|
|
else if (tmp & BTRFS_BLOCK_GROUP_RAID1)
|
|
|
|
tmp = BTRFS_BLOCK_GROUP_RAID1;
|
|
|
|
else if (tmp & BTRFS_BLOCK_GROUP_RAID0)
|
|
|
|
tmp = BTRFS_BLOCK_GROUP_RAID0;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
return extended_to_chunk(flags | tmp);
|
2008-04-29 03:29:52 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
static u64 get_alloc_profile(struct btrfs_root *root, u64 flags)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2013-01-29 18:13:12 +08:00
|
|
|
unsigned seq;
|
|
|
|
|
|
|
|
do {
|
|
|
|
seq = read_seqbegin(&root->fs_info->profiles_lock);
|
|
|
|
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
flags |= root->fs_info->avail_data_alloc_bits;
|
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
flags |= root->fs_info->avail_system_alloc_bits;
|
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
flags |= root->fs_info->avail_metadata_alloc_bits;
|
|
|
|
} while (read_seqretry(&root->fs_info->profiles_lock, seq));
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
return btrfs_reduce_alloc_profile(root, flags);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 18:07:31 +08:00
|
|
|
u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data)
|
2009-09-12 04:12:44 +08:00
|
|
|
{
|
2010-05-16 22:46:24 +08:00
|
|
|
u64 flags;
|
2013-01-30 07:40:14 +08:00
|
|
|
u64 ret;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
if (data)
|
|
|
|
flags = BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
else if (root == root->fs_info->chunk_root)
|
|
|
|
flags = BTRFS_BLOCK_GROUP_SYSTEM;
|
2009-09-12 04:12:44 +08:00
|
|
|
else
|
2010-05-16 22:46:24 +08:00
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
ret = get_alloc_profile(root, flags);
|
|
|
|
return ret;
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
/*
|
|
|
|
* This will check the space that the inode allocates from to make sure we have
|
|
|
|
* enough space for bytes.
|
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
|
|
|
struct btrfs_space_info *data_sinfo;
|
2010-05-16 22:48:47 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2012-07-10 10:21:07 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2010-03-19 22:38:13 +08:00
|
|
|
u64 used;
|
2010-06-22 02:48:16 +08:00
|
|
|
int ret = 0, committed = 0, alloc_chunk = 1;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
|
|
|
/* make sure bytes are sectorsize aligned */
|
2013-02-26 16:10:22 +08:00
|
|
|
bytes = ALIGN(bytes, root->sectorsize);
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2011-04-20 10:33:24 +08:00
|
|
|
if (root == root->fs_info->tree_root ||
|
|
|
|
BTRFS_I(inode)->location.objectid == BTRFS_FREE_INO_OBJECTID) {
|
2010-06-22 02:48:16 +08:00
|
|
|
alloc_chunk = 0;
|
|
|
|
committed = 1;
|
|
|
|
}
|
|
|
|
|
2012-07-10 10:21:07 +08:00
|
|
|
data_sinfo = fs_info->data_sinfo;
|
2009-09-23 02:45:50 +08:00
|
|
|
if (!data_sinfo)
|
|
|
|
goto alloc;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
again:
|
|
|
|
/* make sure we have enough space to handle the data first */
|
|
|
|
spin_lock(&data_sinfo->lock);
|
2010-05-16 22:49:58 +08:00
|
|
|
used = data_sinfo->bytes_used + data_sinfo->bytes_reserved +
|
|
|
|
data_sinfo->bytes_pinned + data_sinfo->bytes_readonly +
|
|
|
|
data_sinfo->bytes_may_use;
|
2010-03-19 22:38:13 +08:00
|
|
|
|
|
|
|
if (used + bytes > data_sinfo->total_bytes) {
|
2009-02-20 23:59:53 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
/*
|
|
|
|
* if we don't have enough free bytes in this space then we need
|
|
|
|
* to alloc a new chunk.
|
|
|
|
*/
|
2010-06-22 02:48:16 +08:00
|
|
|
if (!data_sinfo->full && alloc_chunk) {
|
2009-02-21 00:00:09 +08:00
|
|
|
u64 alloc_target;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
data_sinfo->force_alloc = CHUNK_ALLOC_FORCE;
|
2009-02-21 00:00:09 +08:00
|
|
|
spin_unlock(&data_sinfo->lock);
|
2009-09-23 02:45:50 +08:00
|
|
|
alloc:
|
2009-02-21 00:00:09 +08:00
|
|
|
alloc_target = btrfs_get_alloc_profile(root, 1);
|
2011-04-14 00:54:33 +08:00
|
|
|
trans = btrfs_join_transaction(root);
|
2010-05-16 22:48:46 +08:00
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
ret = do_chunk_alloc(trans, root->fs_info->extent_root,
|
2011-04-16 04:05:44 +08:00
|
|
|
alloc_target,
|
|
|
|
CHUNK_ALLOC_NO_FORCE);
|
2009-02-21 00:00:09 +08:00
|
|
|
btrfs_end_transaction(trans, root);
|
2011-01-05 18:07:18 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
if (ret != -ENOSPC)
|
|
|
|
return ret;
|
|
|
|
else
|
|
|
|
goto commit_trans;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2012-07-10 10:21:07 +08:00
|
|
|
if (!data_sinfo)
|
|
|
|
data_sinfo = fs_info->data_sinfo;
|
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
goto again;
|
|
|
|
}
|
2011-05-26 01:10:16 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have less pinned bytes than we want to allocate then
|
|
|
|
* don't bother committing the transaction, it won't help us.
|
|
|
|
*/
|
|
|
|
if (data_sinfo->bytes_pinned < bytes)
|
|
|
|
committed = 1;
|
2009-02-21 00:00:09 +08:00
|
|
|
spin_unlock(&data_sinfo->lock);
|
|
|
|
|
2009-02-20 23:59:53 +08:00
|
|
|
/* commit the current transaction and try again */
|
2011-01-05 18:07:18 +08:00
|
|
|
commit_trans:
|
2011-04-12 05:25:13 +08:00
|
|
|
if (!committed &&
|
|
|
|
!atomic_read(&root->fs_info->open_ioctl_trans)) {
|
2009-02-20 23:59:53 +08:00
|
|
|
committed = 1;
|
2011-04-14 00:54:33 +08:00
|
|
|
trans = btrfs_join_transaction(root);
|
2010-05-16 22:48:46 +08:00
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
2009-02-20 23:59:53 +08:00
|
|
|
ret = btrfs_commit_transaction(trans, root);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
goto again;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
return -ENOSPC;
|
|
|
|
}
|
|
|
|
data_sinfo->bytes_may_use += bytes;
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(root->fs_info, "space_info",
|
2012-03-29 21:57:44 +08:00
|
|
|
data_sinfo->flags, bytes, 1);
|
2009-02-21 00:00:09 +08:00
|
|
|
spin_unlock(&data_sinfo->lock);
|
|
|
|
|
2009-09-12 04:12:44 +08:00
|
|
|
return 0;
|
|
|
|
}
|
2009-02-21 00:00:09 +08:00
|
|
|
|
|
|
|
/*
|
2011-07-27 05:00:46 +08:00
|
|
|
* Called if we need to clear a data reservation for this inode.
|
2009-02-21 00:00:09 +08:00
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
|
2009-10-08 08:44:34 +08:00
|
|
|
{
|
2010-05-16 22:48:47 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2009-02-21 00:00:09 +08:00
|
|
|
struct btrfs_space_info *data_sinfo;
|
2009-10-08 08:44:34 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
/* make sure bytes are sectorsize aligned */
|
2013-02-26 16:10:22 +08:00
|
|
|
bytes = ALIGN(bytes, root->sectorsize);
|
2009-10-08 08:44:34 +08:00
|
|
|
|
2012-07-10 10:21:07 +08:00
|
|
|
data_sinfo = root->fs_info->data_sinfo;
|
2009-02-21 00:00:09 +08:00
|
|
|
spin_lock(&data_sinfo->lock);
|
|
|
|
data_sinfo->bytes_may_use -= bytes;
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(root->fs_info, "space_info",
|
2012-03-29 21:57:44 +08:00
|
|
|
data_sinfo->flags, bytes, 0);
|
2009-02-21 00:00:09 +08:00
|
|
|
spin_unlock(&data_sinfo->lock);
|
2009-10-08 08:44:34 +08:00
|
|
|
}
|
|
|
|
|
2009-04-22 05:40:57 +08:00
|
|
|
static void force_metadata_allocation(struct btrfs_fs_info *info)
|
2009-10-08 08:44:34 +08:00
|
|
|
{
|
2009-04-22 05:40:57 +08:00
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
2009-10-08 08:44:34 +08:00
|
|
|
|
2009-04-22 05:40:57 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(found, head, list) {
|
|
|
|
if (found->flags & BTRFS_BLOCK_GROUP_METADATA)
|
2011-04-16 04:05:44 +08:00
|
|
|
found->force_alloc = CHUNK_ALLOC_FORCE;
|
2009-10-08 08:44:34 +08:00
|
|
|
}
|
2009-04-22 05:40:57 +08:00
|
|
|
rcu_read_unlock();
|
2009-10-08 08:44:34 +08:00
|
|
|
}
|
|
|
|
|
2010-10-27 01:37:56 +08:00
|
|
|
static int should_alloc_chunk(struct btrfs_root *root,
|
2012-09-13 02:08:47 +08:00
|
|
|
struct btrfs_space_info *sinfo, int force)
|
2009-10-09 01:34:05 +08:00
|
|
|
{
|
2011-07-27 05:00:46 +08:00
|
|
|
struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
|
2011-04-16 04:05:44 +08:00
|
|
|
u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
|
2010-10-27 01:37:56 +08:00
|
|
|
u64 thresh;
|
2009-10-08 08:44:34 +08:00
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
if (force == CHUNK_ALLOC_FORCE)
|
|
|
|
return 1;
|
|
|
|
|
2011-07-27 05:00:46 +08:00
|
|
|
/*
|
|
|
|
* We need to take into account the global rsv because for all intents
|
|
|
|
* and purposes it's used space. Don't worry about locking the
|
|
|
|
* global_rsv, it doesn't change except when the transaction commits.
|
|
|
|
*/
|
2012-08-15 04:20:52 +08:00
|
|
|
if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
num_allocated += global_rsv->size;
|
2011-07-27 05:00:46 +08:00
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
/*
|
|
|
|
* in limited mode, we want to have some free space up to
|
|
|
|
* about 1% of the FS size.
|
|
|
|
*/
|
|
|
|
if (force == CHUNK_ALLOC_LIMITED) {
|
2011-04-13 21:41:04 +08:00
|
|
|
thresh = btrfs_super_total_bytes(root->fs_info->super_copy);
|
2011-04-16 04:05:44 +08:00
|
|
|
thresh = max_t(u64, 64 * 1024 * 1024,
|
|
|
|
div_factor_fine(thresh, 1));
|
|
|
|
|
|
|
|
if (num_bytes - num_allocated < thresh)
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-09-13 02:08:47 +08:00
|
|
|
if (num_allocated + 2 * 1024 * 1024 < div_factor(num_bytes, 8))
|
2010-10-16 03:23:48 +08:00
|
|
|
return 0;
|
2010-05-16 22:46:25 +08:00
|
|
|
return 1;
|
2009-10-09 01:34:05 +08:00
|
|
|
}
|
|
|
|
|
2012-03-29 21:57:44 +08:00
|
|
|
static u64 get_system_chunk_thresh(struct btrfs_root *root, u64 type)
|
|
|
|
{
|
|
|
|
u64 num_dev;
|
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
if (type & (BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID5 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID6))
|
2012-03-29 21:57:44 +08:00
|
|
|
num_dev = root->fs_info->fs_devices->rw_devices;
|
|
|
|
else if (type & BTRFS_BLOCK_GROUP_RAID1)
|
|
|
|
num_dev = 2;
|
|
|
|
else
|
|
|
|
num_dev = 1; /* DUP or single */
|
|
|
|
|
|
|
|
/* metadata for updaing devices and chunk tree */
|
|
|
|
return btrfs_calc_trans_metadata_size(root, num_dev + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void check_system_chunk(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 type)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *info;
|
|
|
|
u64 left;
|
|
|
|
u64 thresh;
|
|
|
|
|
|
|
|
info = __find_space_info(root->fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
left = info->total_bytes - info->bytes_used - info->bytes_pinned -
|
|
|
|
info->bytes_reserved - info->bytes_readonly;
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
|
|
|
thresh = get_system_chunk_thresh(root, type);
|
|
|
|
if (left < thresh && btrfs_test_opt(root, ENOSPC_DEBUG)) {
|
|
|
|
printk(KERN_INFO "left=%llu, need=%llu, flags=%llu\n",
|
|
|
|
left, thresh, type);
|
|
|
|
dump_space_info(info, 0, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (left < thresh) {
|
|
|
|
u64 flags;
|
|
|
|
|
|
|
|
flags = btrfs_get_alloc_profile(root->fs_info->chunk_root, 0);
|
|
|
|
btrfs_alloc_chunk(trans, root, flags);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
static int do_chunk_alloc(struct btrfs_trans_handle *trans,
|
2012-09-13 02:08:47 +08:00
|
|
|
struct btrfs_root *extent_root, u64 flags, int force)
|
2009-09-12 04:12:44 +08:00
|
|
|
{
|
2008-03-25 03:01:59 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2009-04-22 05:40:57 +08:00
|
|
|
struct btrfs_fs_info *fs_info = extent_root->fs_info;
|
2011-04-12 08:20:11 +08:00
|
|
|
int wait_for_alloc = 0;
|
2009-09-12 04:12:44 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2012-12-18 22:16:16 +08:00
|
|
|
/* Don't re-enter if we're already allocating a chunk */
|
|
|
|
if (trans->allocating_chunk)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
space_info = __find_space_info(extent_root->fs_info, flags);
|
2008-03-26 04:50:33 +08:00
|
|
|
if (!space_info) {
|
|
|
|
ret = update_space_info(extent_root->fs_info, flags,
|
|
|
|
0, 0, &space_info);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(!space_info); /* Logic error */
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-04-12 08:20:11 +08:00
|
|
|
again:
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_lock(&space_info->lock);
|
Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 04:01:12 +08:00
|
|
|
if (force < space_info->force_alloc)
|
2011-04-16 04:05:44 +08:00
|
|
|
force = space_info->force_alloc;
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
if (space_info->full) {
|
|
|
|
spin_unlock(&space_info->lock);
|
2011-04-12 08:20:11 +08:00
|
|
|
return 0;
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
|
|
|
|
2012-09-13 02:08:47 +08:00
|
|
|
if (!should_alloc_chunk(extent_root, space_info, force)) {
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2011-04-12 08:20:11 +08:00
|
|
|
return 0;
|
|
|
|
} else if (space_info->chunk_alloc) {
|
|
|
|
wait_for_alloc = 1;
|
|
|
|
} else {
|
|
|
|
space_info->chunk_alloc = 1;
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
2011-04-16 04:05:44 +08:00
|
|
|
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-04-12 08:20:11 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The chunk_mutex is held throughout the entirety of a chunk
|
|
|
|
* allocation, so once we've acquired the chunk_mutex we know that the
|
|
|
|
* other guy is done and we need to recheck and see if we should
|
|
|
|
* allocate.
|
|
|
|
*/
|
|
|
|
if (wait_for_alloc) {
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
wait_for_alloc = 0;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
2012-12-18 22:16:16 +08:00
|
|
|
trans->allocating_chunk = true;
|
|
|
|
|
2010-09-17 04:19:09 +08:00
|
|
|
/*
|
|
|
|
* If we have mixed data/metadata chunks we want to make sure we keep
|
|
|
|
* allocating mixed chunks instead of individual chunks.
|
|
|
|
*/
|
|
|
|
if (btrfs_mixed_space_info(space_info))
|
|
|
|
flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
|
2009-04-22 05:40:57 +08:00
|
|
|
/*
|
|
|
|
* if we're doing a data chunk, go ahead and make sure that
|
|
|
|
* we keep a reasonable number of metadata chunks allocated in the
|
|
|
|
* FS as well.
|
|
|
|
*/
|
2009-09-12 04:12:44 +08:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA && fs_info->metadata_ratio) {
|
2009-04-22 05:40:57 +08:00
|
|
|
fs_info->data_chunk_allocations++;
|
|
|
|
if (!(fs_info->data_chunk_allocations %
|
|
|
|
fs_info->metadata_ratio))
|
|
|
|
force_metadata_allocation(fs_info);
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
|
|
|
|
2012-03-29 21:57:44 +08:00
|
|
|
/*
|
|
|
|
* Check if we have enough space in SYSTEM chunk because we may need
|
|
|
|
* to update devices.
|
|
|
|
*/
|
|
|
|
check_system_chunk(trans, extent_root, flags);
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = btrfs_alloc_chunk(trans, extent_root, flags);
|
2012-12-18 22:16:16 +08:00
|
|
|
trans->allocating_chunk = false;
|
2011-07-13 01:57:59 +08:00
|
|
|
|
2009-09-12 04:12:44 +08:00
|
|
|
spin_lock(&space_info->lock);
|
clear chunk_alloc flag on retryable failure
I've experienced filesystem freezes with permanent spikes in the active
process count for quite a while, particularly on filesystems whose
available raw space has already been fully allocated to chunks.
While looking into this, I found a pretty obvious error in
do_chunk_alloc: it sets space_info->chunk_alloc, but if
btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
that flag set, which causes any other threads waiting for
space_info->chunk_alloc to become zero to spin indefinitely.
I haven't double-checked that this patch fixes the failure I've observed
fully (it's not exactly trivial to trigger), but it surely is a bug and
the fix is trivial, so... Please put it in :-)
What I saw in that function also happens to explain why in some cases I
see filesystems allocate a huge number of chunks that remain unused
(leading to the scenario above, of not having more chunks to allocate).
It happens for data and metadata, but not necessarily both. I'm
guessing some thread sets the force_alloc flag on the corresponding
space_info, and then several threads trying to get disk space end up
attempting to allocate a new chunk concurrently. All of them will see
the force_alloc flag and bump their local copy of force up to the level
they see first, and they won't clear it even if another thread succeeds
in allocating a chunk, thus clearing the force flag. Then each thread
that observed the force flag will, on its turn, force the allocation of
a new chunk. And any threads that come in while it does that will see
the force flag still set and pick it up, and so on. This sounds like a
problem to me, but... what should the correct behavior be? Clear
force_flag once we copy it to a local force? Reset force to the
incoming value on every loop? Set the flag to our incoming force if we
have it at first, clear our local flag, and move it from the space_info
when we determined that we are the thread that's going to perform the
allocation?
btrfs: clear chunk_alloc flag on retryable failure
From: Alexandre Oliva <oliva@gnu.org>
If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
without clearing chunk_alloc in space_info. As a result, any further
calls to do_chunk_alloc on that filesystem will start busy-waiting for
chunk_alloc to be cleared, but it never will be. This patch adjusts
do_chunk_alloc so that it clears this flag in case of an error.
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-22 05:15:14 +08:00
|
|
|
if (ret < 0 && ret != -ENOSPC)
|
|
|
|
goto out;
|
2009-09-12 04:12:44 +08:00
|
|
|
if (ret)
|
2008-03-25 03:01:59 +08:00
|
|
|
space_info->full = 1;
|
2010-05-16 22:46:25 +08:00
|
|
|
else
|
|
|
|
ret = 1;
|
2011-04-12 08:20:11 +08:00
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
|
clear chunk_alloc flag on retryable failure
I've experienced filesystem freezes with permanent spikes in the active
process count for quite a while, particularly on filesystems whose
available raw space has already been fully allocated to chunks.
While looking into this, I found a pretty obvious error in
do_chunk_alloc: it sets space_info->chunk_alloc, but if
btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
that flag set, which causes any other threads waiting for
space_info->chunk_alloc to become zero to spin indefinitely.
I haven't double-checked that this patch fixes the failure I've observed
fully (it's not exactly trivial to trigger), but it surely is a bug and
the fix is trivial, so... Please put it in :-)
What I saw in that function also happens to explain why in some cases I
see filesystems allocate a huge number of chunks that remain unused
(leading to the scenario above, of not having more chunks to allocate).
It happens for data and metadata, but not necessarily both. I'm
guessing some thread sets the force_alloc flag on the corresponding
space_info, and then several threads trying to get disk space end up
attempting to allocate a new chunk concurrently. All of them will see
the force_alloc flag and bump their local copy of force up to the level
they see first, and they won't clear it even if another thread succeeds
in allocating a chunk, thus clearing the force flag. Then each thread
that observed the force flag will, on its turn, force the allocation of
a new chunk. And any threads that come in while it does that will see
the force flag still set and pick it up, and so on. This sounds like a
problem to me, but... what should the correct behavior be? Clear
force_flag once we copy it to a local force? Reset force to the
incoming value on every loop? Set the flag to our incoming force if we
have it at first, clear our local flag, and move it from the space_info
when we determined that we are the thread that's going to perform the
allocation?
btrfs: clear chunk_alloc flag on retryable failure
From: Alexandre Oliva <oliva@gnu.org>
If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
without clearing chunk_alloc in space_info. As a result, any further
calls to do_chunk_alloc on that filesystem will start busy-waiting for
chunk_alloc to be cleared, but it never will be. This patch adjusts
do_chunk_alloc so that it clears this flag in case of an error.
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-22 05:15:14 +08:00
|
|
|
out:
|
2011-04-12 08:20:11 +08:00
|
|
|
space_info->chunk_alloc = 0;
|
2009-09-12 04:12:44 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2012-04-18 14:59:29 +08:00
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return ret;
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2012-09-07 04:59:33 +08:00
|
|
|
static int can_overcommit(struct btrfs_root *root,
|
|
|
|
struct btrfs_space_info *space_info, u64 bytes,
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
enum btrfs_reserve_flush_enum flush)
|
2012-09-07 04:59:33 +08:00
|
|
|
{
|
2013-01-31 06:02:51 +08:00
|
|
|
struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
|
2012-09-07 04:59:33 +08:00
|
|
|
u64 profile = btrfs_get_alloc_profile(root, 0);
|
2013-01-31 06:02:51 +08:00
|
|
|
u64 rsv_size = 0;
|
2012-09-07 04:59:33 +08:00
|
|
|
u64 avail;
|
|
|
|
u64 used;
|
2013-02-07 02:53:19 +08:00
|
|
|
u64 to_add;
|
2012-09-07 04:59:33 +08:00
|
|
|
|
|
|
|
used = space_info->bytes_used + space_info->bytes_reserved +
|
2013-01-31 06:02:51 +08:00
|
|
|
space_info->bytes_pinned + space_info->bytes_readonly;
|
|
|
|
|
|
|
|
spin_lock(&global_rsv->lock);
|
|
|
|
rsv_size = global_rsv->size;
|
|
|
|
spin_unlock(&global_rsv->lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We only want to allow over committing if we have lots of actual space
|
|
|
|
* free, but if we don't have enough space to handle the global reserve
|
|
|
|
* space then we could end up having a real enospc problem when trying
|
|
|
|
* to allocate a chunk or some other such important allocation.
|
|
|
|
*/
|
|
|
|
rsv_size <<= 1;
|
|
|
|
if (used + rsv_size >= space_info->total_bytes)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
used += space_info->bytes_may_use;
|
2012-09-07 04:59:33 +08:00
|
|
|
|
|
|
|
spin_lock(&root->fs_info->free_chunk_lock);
|
|
|
|
avail = root->fs_info->free_chunk_space;
|
|
|
|
spin_unlock(&root->fs_info->free_chunk_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have dup, raid1 or raid10 then only half of the free
|
2013-01-30 07:40:14 +08:00
|
|
|
* space is actually useable. For raid56, the space info used
|
|
|
|
* doesn't include the parity drive, so we don't have to
|
|
|
|
* change the math
|
2012-09-07 04:59:33 +08:00
|
|
|
*/
|
|
|
|
if (profile & (BTRFS_BLOCK_GROUP_DUP |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
avail >>= 1;
|
|
|
|
|
2013-02-07 02:53:19 +08:00
|
|
|
to_add = space_info->total_bytes;
|
|
|
|
|
2012-09-07 04:59:33 +08:00
|
|
|
/*
|
2012-10-16 19:32:18 +08:00
|
|
|
* If we aren't flushing all things, let us overcommit up to
|
|
|
|
* 1/2th of the space. If we can flush, don't let us overcommit
|
|
|
|
* too much, let it overcommit up to 1/8 of the space.
|
2012-09-07 04:59:33 +08:00
|
|
|
*/
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
if (flush == BTRFS_RESERVE_FLUSH_ALL)
|
2013-02-07 02:53:19 +08:00
|
|
|
to_add >>= 3;
|
2012-09-07 04:59:33 +08:00
|
|
|
else
|
2013-02-07 02:53:19 +08:00
|
|
|
to_add >>= 1;
|
2012-09-07 04:59:33 +08:00
|
|
|
|
2013-02-07 02:53:19 +08:00
|
|
|
/*
|
|
|
|
* Limit the overcommit to the amount of free space we could possibly
|
|
|
|
* allocate for chunks.
|
|
|
|
*/
|
|
|
|
to_add = min(avail, to_add);
|
2012-09-07 04:59:33 +08:00
|
|
|
|
2013-02-07 02:53:19 +08:00
|
|
|
if (used + bytes < space_info->total_bytes + to_add)
|
2012-09-07 04:59:33 +08:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-12-20 19:19:09 +08:00
|
|
|
static inline int writeback_inodes_sb_nr_if_idle_safe(struct super_block *sb,
|
|
|
|
unsigned long nr_pages,
|
|
|
|
enum wb_reason reason)
|
2012-12-15 02:46:43 +08:00
|
|
|
{
|
2012-12-20 19:19:09 +08:00
|
|
|
/* the flusher is dealing with the dirty inodes now. */
|
|
|
|
if (writeback_in_progress(sb->s_bdi))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (down_read_trylock(&sb->s_umount)) {
|
2012-12-15 02:46:43 +08:00
|
|
|
writeback_inodes_sb_nr(sb, nr_pages, reason);
|
|
|
|
up_read(&sb->s_umount);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-12-20 19:19:09 +08:00
|
|
|
void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root,
|
|
|
|
unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
struct super_block *sb = root->fs_info->sb;
|
|
|
|
int started;
|
|
|
|
|
|
|
|
/* If we can not start writeback, just sync all the delalloc file. */
|
|
|
|
started = writeback_inodes_sb_nr_if_idle_safe(sb, nr_pages,
|
|
|
|
WB_REASON_FS_FREE_SPACE);
|
|
|
|
if (!started) {
|
|
|
|
/*
|
|
|
|
* We needn't worry the filesystem going from r/w to r/o though
|
|
|
|
* we don't acquire ->s_umount mutex, because the filesystem
|
|
|
|
* should guarantee the delalloc inodes list be empty after
|
|
|
|
* the filesystem is readonly(all dirty pages are written to
|
|
|
|
* the disk).
|
|
|
|
*/
|
|
|
|
btrfs_start_delalloc_inodes(root, 0);
|
|
|
|
btrfs_wait_ordered_extents(root, 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-09-12 04:12:44 +08:00
|
|
|
/*
|
2010-05-16 22:46:25 +08:00
|
|
|
* shrink metadata reservation for delalloc
|
2009-09-12 04:12:44 +08:00
|
|
|
*/
|
2012-07-03 05:10:51 +08:00
|
|
|
static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig,
|
|
|
|
bool wait_ordered)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
2010-05-16 22:48:47 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv;
|
2010-10-16 03:18:40 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2011-11-04 10:54:25 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2012-07-03 05:10:51 +08:00
|
|
|
u64 delalloc_bytes;
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 max_reclaim;
|
2011-01-22 05:10:01 +08:00
|
|
|
long time_left;
|
2011-10-15 02:02:10 +08:00
|
|
|
unsigned long nr_pages = (2 * 1024 * 1024) >> PAGE_CACHE_SHIFT;
|
2011-01-22 05:10:01 +08:00
|
|
|
int loops = 0;
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
enum btrfs_reserve_flush_enum flush;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2011-11-04 10:54:25 +08:00
|
|
|
trans = (struct btrfs_trans_handle *)current->journal_info;
|
2010-05-16 22:48:47 +08:00
|
|
|
block_rsv = &root->fs_info->delalloc_block_rsv;
|
2010-10-16 03:18:40 +08:00
|
|
|
space_info = block_rsv->space_info;
|
2010-10-27 01:40:45 +08:00
|
|
|
|
|
|
|
smp_mb();
|
2013-01-29 18:10:51 +08:00
|
|
|
delalloc_bytes = percpu_counter_sum_positive(
|
|
|
|
&root->fs_info->delalloc_bytes);
|
2012-07-03 05:10:51 +08:00
|
|
|
if (delalloc_bytes == 0) {
|
2011-06-08 04:07:44 +08:00
|
|
|
if (trans)
|
2012-07-03 05:10:51 +08:00
|
|
|
return;
|
2012-09-14 16:58:07 +08:00
|
|
|
btrfs_wait_ordered_extents(root, 0);
|
2012-07-03 05:10:51 +08:00
|
|
|
return;
|
2011-06-08 04:07:44 +08:00
|
|
|
}
|
|
|
|
|
2012-07-03 05:10:51 +08:00
|
|
|
while (delalloc_bytes && loops < 3) {
|
|
|
|
max_reclaim = min(delalloc_bytes, to_reclaim);
|
|
|
|
nr_pages = max_reclaim >> PAGE_CACHE_SHIFT;
|
2012-12-20 19:19:09 +08:00
|
|
|
btrfs_writeback_inodes_sb_nr(root, nr_pages);
|
2012-09-07 04:47:00 +08:00
|
|
|
/*
|
|
|
|
* We need to wait for the async pages to actually start before
|
|
|
|
* we do anything.
|
|
|
|
*/
|
|
|
|
wait_event(root->fs_info->async_submit_wait,
|
|
|
|
!atomic_read(&root->fs_info->async_delalloc_pages));
|
|
|
|
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
if (!trans)
|
|
|
|
flush = BTRFS_RESERVE_FLUSH_ALL;
|
|
|
|
else
|
|
|
|
flush = BTRFS_RESERVE_NO_FLUSH;
|
2010-10-16 03:18:40 +08:00
|
|
|
spin_lock(&space_info->lock);
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
if (can_overcommit(root, space_info, orig, flush)) {
|
2012-07-03 05:10:51 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
break;
|
|
|
|
}
|
2010-10-16 03:18:40 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2011-03-12 20:08:42 +08:00
|
|
|
loops++;
|
2011-10-15 01:56:58 +08:00
|
|
|
if (wait_ordered && !trans) {
|
2012-09-14 16:58:07 +08:00
|
|
|
btrfs_wait_ordered_extents(root, 0);
|
2011-10-15 01:56:58 +08:00
|
|
|
} else {
|
2012-07-03 05:10:51 +08:00
|
|
|
time_left = schedule_timeout_killable(1);
|
2011-10-15 01:56:58 +08:00
|
|
|
if (time_left)
|
|
|
|
break;
|
|
|
|
}
|
2012-07-03 05:10:51 +08:00
|
|
|
smp_mb();
|
2013-01-29 18:10:51 +08:00
|
|
|
delalloc_bytes = percpu_counter_sum_positive(
|
|
|
|
&root->fs_info->delalloc_bytes);
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-11-04 10:54:25 +08:00
|
|
|
/**
|
|
|
|
* maybe_commit_transaction - possibly commit the transaction if its ok to
|
|
|
|
* @root - the root we're allocating for
|
|
|
|
* @bytes - the number of bytes we want to reserve
|
|
|
|
* @force - force the commit
|
2010-10-16 04:52:49 +08:00
|
|
|
*
|
2011-11-04 10:54:25 +08:00
|
|
|
* This will check to make sure that committing the transaction will actually
|
|
|
|
* get us somewhere and then commit the transaction if it does. Otherwise it
|
|
|
|
* will return -ENOSPC.
|
2010-10-16 04:52:49 +08:00
|
|
|
*/
|
2011-11-04 10:54:25 +08:00
|
|
|
static int may_commit_transaction(struct btrfs_root *root,
|
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
u64 bytes, int force)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *delayed_rsv = &root->fs_info->delayed_block_rsv;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
|
|
|
|
trans = (struct btrfs_trans_handle *)current->journal_info;
|
|
|
|
if (trans)
|
|
|
|
return -EAGAIN;
|
|
|
|
|
|
|
|
if (force)
|
|
|
|
goto commit;
|
|
|
|
|
|
|
|
/* See if there is enough pinned space to make this reservation */
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (space_info->bytes_pinned >= bytes) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
goto commit;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See if there is some space in the delayed insertion reservation for
|
|
|
|
* this reservation.
|
|
|
|
*/
|
|
|
|
if (space_info != delayed_rsv->space_info)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
2012-02-16 18:34:39 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2011-11-04 10:54:25 +08:00
|
|
|
spin_lock(&delayed_rsv->lock);
|
2012-02-16 18:34:39 +08:00
|
|
|
if (space_info->bytes_pinned + delayed_rsv->size < bytes) {
|
2011-11-04 10:54:25 +08:00
|
|
|
spin_unlock(&delayed_rsv->lock);
|
2012-02-16 18:34:39 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2011-11-04 10:54:25 +08:00
|
|
|
return -ENOSPC;
|
|
|
|
}
|
|
|
|
spin_unlock(&delayed_rsv->lock);
|
2012-02-16 18:34:39 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2011-11-04 10:54:25 +08:00
|
|
|
|
|
|
|
commit:
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans))
|
|
|
|
return -ENOSPC;
|
|
|
|
|
|
|
|
return btrfs_commit_transaction(trans, root);
|
|
|
|
}
|
|
|
|
|
2012-06-22 02:05:49 +08:00
|
|
|
enum flush_state {
|
2012-09-25 01:42:00 +08:00
|
|
|
FLUSH_DELAYED_ITEMS_NR = 1,
|
|
|
|
FLUSH_DELAYED_ITEMS = 2,
|
|
|
|
FLUSH_DELALLOC = 3,
|
|
|
|
FLUSH_DELALLOC_WAIT = 4,
|
2012-09-12 04:57:25 +08:00
|
|
|
ALLOC_CHUNK = 5,
|
|
|
|
COMMIT_TRANS = 6,
|
2012-06-22 02:05:49 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
static int flush_space(struct btrfs_root *root,
|
|
|
|
struct btrfs_space_info *space_info, u64 num_bytes,
|
|
|
|
u64 orig_bytes, int state)
|
|
|
|
{
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
int nr;
|
2012-07-03 05:10:51 +08:00
|
|
|
int ret = 0;
|
2012-06-22 02:05:49 +08:00
|
|
|
|
|
|
|
switch (state) {
|
|
|
|
case FLUSH_DELAYED_ITEMS_NR:
|
|
|
|
case FLUSH_DELAYED_ITEMS:
|
|
|
|
if (state == FLUSH_DELAYED_ITEMS_NR) {
|
|
|
|
u64 bytes = btrfs_calc_trans_metadata_size(root, 1);
|
|
|
|
|
|
|
|
nr = (int)div64_u64(num_bytes, bytes);
|
|
|
|
if (!nr)
|
|
|
|
nr = 1;
|
|
|
|
nr *= 2;
|
|
|
|
} else {
|
|
|
|
nr = -1;
|
|
|
|
}
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret = btrfs_run_delayed_items_nr(trans, root, nr);
|
|
|
|
btrfs_end_transaction(trans, root);
|
|
|
|
break;
|
2012-09-25 01:42:00 +08:00
|
|
|
case FLUSH_DELALLOC:
|
|
|
|
case FLUSH_DELALLOC_WAIT:
|
|
|
|
shrink_delalloc(root, num_bytes, orig_bytes,
|
|
|
|
state == FLUSH_DELALLOC_WAIT);
|
|
|
|
break;
|
2012-09-12 04:57:25 +08:00
|
|
|
case ALLOC_CHUNK:
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret = do_chunk_alloc(trans, root->fs_info->extent_root,
|
|
|
|
btrfs_get_alloc_profile(root, 0),
|
|
|
|
CHUNK_ALLOC_NO_FORCE);
|
|
|
|
btrfs_end_transaction(trans, root);
|
|
|
|
if (ret == -ENOSPC)
|
|
|
|
ret = 0;
|
|
|
|
break;
|
2012-06-22 02:05:49 +08:00
|
|
|
case COMMIT_TRANS:
|
|
|
|
ret = may_commit_transaction(root, space_info, orig_bytes, 0);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ret = -ENOSPC;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2011-08-31 00:34:28 +08:00
|
|
|
/**
|
|
|
|
* reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
|
|
|
|
* @root - the root we're allocating for
|
|
|
|
* @block_rsv - the block_rsv we're allocating for
|
|
|
|
* @orig_bytes - the number of bytes we want
|
2012-09-20 09:48:00 +08:00
|
|
|
* @flush - whether or not we can flush to make our reservation
|
2010-10-16 04:52:49 +08:00
|
|
|
*
|
2011-08-31 00:34:28 +08:00
|
|
|
* This will reserve orgi_bytes number of bytes from the space info associated
|
|
|
|
* with the block_rsv. If there is not enough space it will make an attempt to
|
|
|
|
* flush out space to make room. It will do this by flushing delalloc if
|
|
|
|
* possible or committing the transaction. If flush is 0 then no attempts to
|
|
|
|
* regain reservations will be made and this will fail if there is not enough
|
|
|
|
* space already.
|
2010-10-16 04:52:49 +08:00
|
|
|
*/
|
2011-08-31 00:34:28 +08:00
|
|
|
static int reserve_metadata_bytes(struct btrfs_root *root,
|
2010-10-16 04:52:49 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv,
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
u64 orig_bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
2009-09-12 04:12:44 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_space_info *space_info = block_rsv->space_info;
|
2011-09-27 05:12:22 +08:00
|
|
|
u64 used;
|
2010-10-16 04:52:49 +08:00
|
|
|
u64 num_bytes = orig_bytes;
|
2012-09-25 01:42:00 +08:00
|
|
|
int flush_state = FLUSH_DELAYED_ITEMS_NR;
|
2010-10-16 04:52:49 +08:00
|
|
|
int ret = 0;
|
2011-06-08 04:07:44 +08:00
|
|
|
bool flushing = false;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-10-16 04:52:49 +08:00
|
|
|
again:
|
2011-06-08 04:07:44 +08:00
|
|
|
ret = 0;
|
2010-10-16 04:52:49 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2011-06-08 04:07:44 +08:00
|
|
|
/*
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
* We only want to wait if somebody other than us is flushing and we
|
|
|
|
* are actually allowed to flush all things.
|
2011-06-08 04:07:44 +08:00
|
|
|
*/
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
while (flush == BTRFS_RESERVE_FLUSH_ALL && !flushing &&
|
|
|
|
space_info->flush) {
|
2011-06-08 04:07:44 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
/*
|
|
|
|
* If we have a trans handle we can't wait because the flusher
|
|
|
|
* may have to commit the transaction, which would mean we would
|
|
|
|
* deadlock since we are waiting for the flusher to finish, but
|
|
|
|
* hold the current transaction open.
|
|
|
|
*/
|
2011-11-04 10:54:25 +08:00
|
|
|
if (current->journal_info)
|
2011-06-08 04:07:44 +08:00
|
|
|
return -EAGAIN;
|
2012-04-18 16:27:16 +08:00
|
|
|
ret = wait_event_killable(space_info->wait, !space_info->flush);
|
|
|
|
/* Must have been killed, return */
|
|
|
|
if (ret)
|
2011-06-08 04:07:44 +08:00
|
|
|
return -EINTR;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = -ENOSPC;
|
2011-09-27 05:12:22 +08:00
|
|
|
used = space_info->bytes_used + space_info->bytes_reserved +
|
|
|
|
space_info->bytes_pinned + space_info->bytes_readonly +
|
|
|
|
space_info->bytes_may_use;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-10-16 04:52:49 +08:00
|
|
|
/*
|
|
|
|
* The idea here is that we've not already over-reserved the block group
|
|
|
|
* then we can go ahead and save our reservation first and then start
|
|
|
|
* flushing if we need to. Otherwise if we've already overcommitted
|
|
|
|
* lets start flushing stuff first and then come back and try to make
|
|
|
|
* our reservation.
|
|
|
|
*/
|
2011-09-27 05:12:22 +08:00
|
|
|
if (used <= space_info->total_bytes) {
|
|
|
|
if (used + orig_bytes <= space_info->total_bytes) {
|
2011-07-27 05:00:46 +08:00
|
|
|
space_info->bytes_may_use += orig_bytes;
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(root->fs_info,
|
2012-03-29 21:57:44 +08:00
|
|
|
"space_info", space_info->flags, orig_bytes, 1);
|
2010-10-16 04:52:49 +08:00
|
|
|
ret = 0;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Ok set num_bytes to orig_bytes since we aren't
|
|
|
|
* overocmmitted, this way we only try and reclaim what
|
|
|
|
* we need.
|
|
|
|
*/
|
|
|
|
num_bytes = orig_bytes;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Ok we're over committed, set num_bytes to the overcommitted
|
|
|
|
* amount plus the amount of bytes that we need for this
|
|
|
|
* reservation.
|
|
|
|
*/
|
2011-09-27 05:12:22 +08:00
|
|
|
num_bytes = used - space_info->total_bytes +
|
2012-06-22 02:05:49 +08:00
|
|
|
(orig_bytes * 2);
|
2010-10-16 04:52:49 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2012-09-29 04:04:19 +08:00
|
|
|
if (ret && can_overcommit(root, space_info, orig_bytes, flush)) {
|
|
|
|
space_info->bytes_may_use += orig_bytes;
|
|
|
|
trace_btrfs_space_reservation(root->fs_info, "space_info",
|
|
|
|
space_info->flags, orig_bytes,
|
|
|
|
1);
|
|
|
|
ret = 0;
|
2011-09-27 05:12:22 +08:00
|
|
|
}
|
|
|
|
|
2010-10-16 04:52:49 +08:00
|
|
|
/*
|
|
|
|
* Couldn't make our reservation, save our place so while we're trying
|
|
|
|
* to reclaim space we can actually use it instead of somebody else
|
|
|
|
* stealing it from us.
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
*
|
|
|
|
* We make the other tasks wait for the flush only when we can flush
|
|
|
|
* all things.
|
2010-10-16 04:52:49 +08:00
|
|
|
*/
|
2012-12-19 04:16:34 +08:00
|
|
|
if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
|
2011-06-08 04:07:44 +08:00
|
|
|
flushing = true;
|
|
|
|
space_info->flush = 1;
|
2010-10-16 04:52:49 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2009-09-12 04:12:44 +08:00
|
|
|
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
|
2010-10-16 04:52:49 +08:00
|
|
|
goto out;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2012-06-22 02:05:49 +08:00
|
|
|
ret = flush_space(root, space_info, num_bytes, orig_bytes,
|
|
|
|
flush_state);
|
|
|
|
flush_state++;
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are FLUSH_LIMIT, we can not flush delalloc, or the deadlock
|
|
|
|
* would happen. So skip delalloc flush.
|
|
|
|
*/
|
|
|
|
if (flush == BTRFS_RESERVE_FLUSH_LIMIT &&
|
|
|
|
(flush_state == FLUSH_DELALLOC ||
|
|
|
|
flush_state == FLUSH_DELALLOC_WAIT))
|
|
|
|
flush_state = ALLOC_CHUNK;
|
|
|
|
|
2012-06-22 02:05:49 +08:00
|
|
|
if (!ret)
|
2010-10-16 04:52:49 +08:00
|
|
|
goto again;
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
else if (flush == BTRFS_RESERVE_FLUSH_LIMIT &&
|
|
|
|
flush_state < COMMIT_TRANS)
|
|
|
|
goto again;
|
|
|
|
else if (flush == BTRFS_RESERVE_FLUSH_ALL &&
|
|
|
|
flush_state <= COMMIT_TRANS)
|
2010-10-16 04:52:49 +08:00
|
|
|
goto again;
|
|
|
|
|
|
|
|
out:
|
2013-02-08 05:06:02 +08:00
|
|
|
if (ret == -ENOSPC &&
|
|
|
|
unlikely(root->orphan_cleanup_state == ORPHAN_CLEANUP_STARTED)) {
|
|
|
|
struct btrfs_block_rsv *global_rsv =
|
|
|
|
&root->fs_info->global_block_rsv;
|
|
|
|
|
|
|
|
if (block_rsv != global_rsv &&
|
|
|
|
!block_rsv_use_bytes(global_rsv, orig_bytes))
|
|
|
|
ret = 0;
|
|
|
|
}
|
2011-06-08 04:07:44 +08:00
|
|
|
if (flushing) {
|
2010-10-16 04:52:49 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2011-06-08 04:07:44 +08:00
|
|
|
space_info->flush = 0;
|
|
|
|
wake_up_all(&space_info->wait);
|
2010-10-16 04:52:49 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
static struct btrfs_block_rsv *get_block_rsv(
|
|
|
|
const struct btrfs_trans_handle *trans,
|
|
|
|
const struct btrfs_root *root)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
2011-08-30 23:31:29 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv = NULL;
|
|
|
|
|
2012-06-27 04:13:18 +08:00
|
|
|
if (root->ref_cows)
|
|
|
|
block_rsv = trans->block_rsv;
|
|
|
|
|
|
|
|
if (root == root->fs_info->csum_root && trans->adding_csums)
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv = trans->block_rsv;
|
2011-08-30 23:31:29 +08:00
|
|
|
|
|
|
|
if (!block_rsv)
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv = root->block_rsv;
|
|
|
|
|
|
|
|
if (!block_rsv)
|
|
|
|
block_rsv = &root->fs_info->empty_block_rsv;
|
|
|
|
|
|
|
|
return block_rsv;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
|
|
|
int ret = -ENOSPC;
|
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
if (block_rsv->reserved >= num_bytes) {
|
|
|
|
block_rsv->reserved -= num_bytes;
|
|
|
|
if (block_rsv->reserved < block_rsv->size)
|
|
|
|
block_rsv->full = 0;
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_rsv->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void block_rsv_add_bytes(struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes, int update_size)
|
|
|
|
{
|
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
block_rsv->reserved += num_bytes;
|
|
|
|
if (update_size)
|
|
|
|
block_rsv->size += num_bytes;
|
|
|
|
else if (block_rsv->reserved >= block_rsv->size)
|
|
|
|
block_rsv->full = 1;
|
|
|
|
spin_unlock(&block_rsv->lock);
|
|
|
|
}
|
|
|
|
|
2012-01-10 23:31:31 +08:00
|
|
|
static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
2011-04-20 21:52:26 +08:00
|
|
|
struct btrfs_block_rsv *dest, u64 num_bytes)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
struct btrfs_space_info *space_info = block_rsv->space_info;
|
|
|
|
|
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
if (num_bytes == (u64)-1)
|
|
|
|
num_bytes = block_rsv->size;
|
|
|
|
block_rsv->size -= num_bytes;
|
|
|
|
if (block_rsv->reserved >= block_rsv->size) {
|
|
|
|
num_bytes = block_rsv->reserved - block_rsv->size;
|
|
|
|
block_rsv->reserved = block_rsv->size;
|
|
|
|
block_rsv->full = 1;
|
|
|
|
} else {
|
|
|
|
num_bytes = 0;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_rsv->lock);
|
|
|
|
|
|
|
|
if (num_bytes > 0) {
|
|
|
|
if (dest) {
|
2011-01-25 05:43:19 +08:00
|
|
|
spin_lock(&dest->lock);
|
|
|
|
if (!dest->full) {
|
|
|
|
u64 bytes_to_add;
|
|
|
|
|
|
|
|
bytes_to_add = dest->size - dest->reserved;
|
|
|
|
bytes_to_add = min(num_bytes, bytes_to_add);
|
|
|
|
dest->reserved += bytes_to_add;
|
|
|
|
if (dest->reserved >= dest->size)
|
|
|
|
dest->full = 1;
|
|
|
|
num_bytes -= bytes_to_add;
|
|
|
|
}
|
|
|
|
spin_unlock(&dest->lock);
|
|
|
|
}
|
|
|
|
if (num_bytes) {
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2011-07-27 05:00:46 +08:00
|
|
|
space_info->bytes_may_use -= num_bytes;
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(fs_info, "space_info",
|
2012-03-29 21:57:44 +08:00
|
|
|
space_info->flags, num_bytes, 0);
|
2011-03-12 20:08:42 +08:00
|
|
|
space_info->reservation_progress++;
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2009-02-20 23:59:53 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2009-02-20 23:59:53 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
static int block_rsv_migrate_bytes(struct btrfs_block_rsv *src,
|
|
|
|
struct btrfs_block_rsv *dst, u64 num_bytes)
|
|
|
|
{
|
|
|
|
int ret;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
ret = block_rsv_use_bytes(src, num_bytes);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv_add_bytes(dst, num_bytes, 1);
|
2009-09-12 04:12:44 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-09-06 18:02:28 +08:00
|
|
|
void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type)
|
2009-09-12 04:12:44 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
memset(rsv, 0, sizeof(*rsv));
|
|
|
|
spin_lock_init(&rsv->lock);
|
2012-09-06 18:02:28 +08:00
|
|
|
rsv->type = type;
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
|
2012-09-06 18:02:28 +08:00
|
|
|
struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
|
|
|
|
unsigned short type)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *block_rsv;
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv = kmalloc(sizeof(*block_rsv), GFP_NOFS);
|
|
|
|
if (!block_rsv)
|
|
|
|
return NULL;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2012-09-06 18:02:28 +08:00
|
|
|
btrfs_init_block_rsv(block_rsv, type);
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv->space_info = __find_space_info(fs_info,
|
|
|
|
BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
return block_rsv;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
void btrfs_free_block_rsv(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *rsv)
|
|
|
|
{
|
2012-08-30 02:27:18 +08:00
|
|
|
if (!rsv)
|
|
|
|
return;
|
2011-08-09 00:50:18 +08:00
|
|
|
btrfs_block_rsv_release(root, rsv, (u64)-1);
|
|
|
|
kfree(rsv);
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
int btrfs_block_rsv_add(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv, u64 num_bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
2009-09-12 04:12:44 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
if (num_bytes == 0)
|
|
|
|
return 0;
|
2010-10-16 04:52:49 +08:00
|
|
|
|
2011-11-11 09:45:05 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (!ret) {
|
|
|
|
block_rsv_add_bytes(block_rsv, num_bytes, 1);
|
|
|
|
return 0;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-08-31 00:34:28 +08:00
|
|
|
int btrfs_block_rsv_check(struct btrfs_root *root,
|
2011-10-19 00:15:48 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv, int min_factor)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
u64 num_bytes = 0;
|
|
|
|
int ret = -ENOSPC;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
if (!block_rsv)
|
|
|
|
return 0;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_lock(&block_rsv->lock);
|
2011-10-19 00:15:48 +08:00
|
|
|
num_bytes = div_factor(block_rsv->size, min_factor);
|
|
|
|
if (block_rsv->reserved >= num_bytes)
|
|
|
|
ret = 0;
|
|
|
|
spin_unlock(&block_rsv->lock);
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-10-19 00:15:48 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
int btrfs_block_rsv_refill(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv, u64 min_reserved,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
2011-10-19 00:15:48 +08:00
|
|
|
{
|
|
|
|
u64 num_bytes = 0;
|
|
|
|
int ret = -ENOSPC;
|
|
|
|
|
|
|
|
if (!block_rsv)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
num_bytes = min_reserved;
|
2011-08-09 01:33:21 +08:00
|
|
|
if (block_rsv->reserved >= num_bytes)
|
2010-05-16 22:46:25 +08:00
|
|
|
ret = 0;
|
2011-08-09 01:33:21 +08:00
|
|
|
else
|
2010-05-16 22:46:25 +08:00
|
|
|
num_bytes -= block_rsv->reserved;
|
|
|
|
spin_unlock(&block_rsv->lock);
|
2011-08-09 01:33:21 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
|
2011-11-18 17:43:00 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
|
2011-08-09 00:50:18 +08:00
|
|
|
if (!ret) {
|
|
|
|
block_rsv_add_bytes(block_rsv, num_bytes, 0);
|
2010-05-16 22:46:25 +08:00
|
|
|
return 0;
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-08-09 01:33:21 +08:00
|
|
|
return ret;
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
|
|
|
|
struct btrfs_block_rsv *dst_rsv,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
|
|
|
return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_block_rsv_release(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
|
|
|
|
if (global_rsv->full || global_rsv == block_rsv ||
|
|
|
|
block_rsv->space_info != global_rsv->space_info)
|
|
|
|
global_rsv = NULL;
|
2012-01-10 23:31:31 +08:00
|
|
|
block_rsv_release_bytes(root->fs_info, block_rsv, global_rsv,
|
|
|
|
num_bytes);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-05-16 22:49:58 +08:00
|
|
|
* helper to calculate size of global block reservation.
|
|
|
|
* the desired value is sum of space used by extent tree,
|
|
|
|
* checksum tree and root tree
|
2009-02-21 00:00:09 +08:00
|
|
|
*/
|
2010-05-16 22:49:58 +08:00
|
|
|
static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2010-05-16 22:49:58 +08:00
|
|
|
struct btrfs_space_info *sinfo;
|
|
|
|
u64 num_bytes;
|
|
|
|
u64 meta_used;
|
|
|
|
u64 data_used;
|
2011-04-13 21:41:04 +08:00
|
|
|
int csum_size = btrfs_super_csum_size(fs_info->super_copy);
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
|
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
data_used = sinfo->bytes_used;
|
|
|
|
spin_unlock(&sinfo->lock);
|
2009-09-23 02:45:50 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
spin_lock(&sinfo->lock);
|
2010-10-16 03:13:32 +08:00
|
|
|
if (sinfo->flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
data_used = 0;
|
2010-05-16 22:49:58 +08:00
|
|
|
meta_used = sinfo->bytes_used;
|
|
|
|
spin_unlock(&sinfo->lock);
|
2010-03-19 22:38:13 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
num_bytes = (data_used >> fs_info->sb->s_blocksize_bits) *
|
|
|
|
csum_size * 2;
|
|
|
|
num_bytes += div64_u64(data_used + meta_used, 50);
|
2009-02-20 23:59:53 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
if (num_bytes * 3 > meta_used)
|
2012-04-13 01:46:48 +08:00
|
|
|
num_bytes = div64_u64(meta_used, 3);
|
2010-03-19 22:38:13 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
return ALIGN(num_bytes, fs_info->extent_root->leafsize << 10);
|
|
|
|
}
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
|
|
|
|
struct btrfs_space_info *sinfo = block_rsv->space_info;
|
|
|
|
u64 num_bytes;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
num_bytes = calc_global_metadata_size(fs_info);
|
2009-09-23 02:45:50 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
spin_lock(&sinfo->lock);
|
2012-04-28 00:41:46 +08:00
|
|
|
spin_lock(&block_rsv->lock);
|
2009-02-20 23:59:53 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
block_rsv->size = num_bytes;
|
2009-02-20 23:59:53 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
num_bytes = sinfo->bytes_used + sinfo->bytes_pinned +
|
2010-10-16 03:13:32 +08:00
|
|
|
sinfo->bytes_reserved + sinfo->bytes_readonly +
|
|
|
|
sinfo->bytes_may_use;
|
2010-05-16 22:49:58 +08:00
|
|
|
|
|
|
|
if (sinfo->total_bytes > num_bytes) {
|
|
|
|
num_bytes = sinfo->total_bytes - num_bytes;
|
|
|
|
block_rsv->reserved += num_bytes;
|
2011-07-27 05:00:46 +08:00
|
|
|
sinfo->bytes_may_use += num_bytes;
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(fs_info, "space_info",
|
2012-03-29 21:57:44 +08:00
|
|
|
sinfo->flags, num_bytes, 1);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
if (block_rsv->reserved >= block_rsv->size) {
|
|
|
|
num_bytes = block_rsv->reserved - block_rsv->size;
|
2011-07-27 05:00:46 +08:00
|
|
|
sinfo->bytes_may_use -= num_bytes;
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(fs_info, "space_info",
|
2012-03-29 21:57:44 +08:00
|
|
|
sinfo->flags, num_bytes, 0);
|
2011-03-12 20:08:42 +08:00
|
|
|
sinfo->reservation_progress++;
|
2010-05-16 22:49:58 +08:00
|
|
|
block_rsv->reserved = block_rsv->size;
|
|
|
|
block_rsv->full = 1;
|
|
|
|
}
|
2011-05-05 19:13:16 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
spin_unlock(&block_rsv->lock);
|
2012-04-28 00:41:46 +08:00
|
|
|
spin_unlock(&sinfo->lock);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
|
|
|
|
fs_info->chunk_block_rsv.space_info = space_info;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
|
2010-05-16 22:49:58 +08:00
|
|
|
fs_info->global_block_rsv.space_info = space_info;
|
|
|
|
fs_info->delalloc_block_rsv.space_info = space_info;
|
2010-05-16 22:46:25 +08:00
|
|
|
fs_info->trans_block_rsv.space_info = space_info;
|
|
|
|
fs_info->empty_block_rsv.space_info = space_info;
|
2011-11-04 10:54:25 +08:00
|
|
|
fs_info->delayed_block_rsv.space_info = space_info;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
fs_info->extent_root->block_rsv = &fs_info->global_block_rsv;
|
|
|
|
fs_info->csum_root->block_rsv = &fs_info->global_block_rsv;
|
|
|
|
fs_info->dev_root->block_rsv = &fs_info->global_block_rsv;
|
|
|
|
fs_info->tree_root->block_rsv = &fs_info->global_block_rsv;
|
2010-05-16 22:46:25 +08:00
|
|
|
fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv;
|
2010-05-16 22:49:58 +08:00
|
|
|
|
|
|
|
update_global_block_rsv(fs_info);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
static void release_global_block_rsv(struct btrfs_fs_info *fs_info)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2012-01-10 23:31:31 +08:00
|
|
|
block_rsv_release_bytes(fs_info, &fs_info->global_block_rsv, NULL,
|
|
|
|
(u64)-1);
|
2010-05-16 22:49:58 +08:00
|
|
|
WARN_ON(fs_info->delalloc_block_rsv.size > 0);
|
|
|
|
WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
|
|
|
|
WARN_ON(fs_info->trans_block_rsv.size > 0);
|
|
|
|
WARN_ON(fs_info->trans_block_rsv.reserved > 0);
|
|
|
|
WARN_ON(fs_info->chunk_block_rsv.size > 0);
|
|
|
|
WARN_ON(fs_info->chunk_block_rsv.reserved > 0);
|
2011-11-04 10:54:25 +08:00
|
|
|
WARN_ON(fs_info->delayed_block_rsv.size > 0);
|
|
|
|
WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
|
2011-05-03 22:40:22 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2012-06-27 04:13:18 +08:00
|
|
|
if (!trans->block_rsv)
|
|
|
|
return;
|
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
if (!trans->bytes_reserved)
|
|
|
|
return;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2012-02-24 23:39:05 +08:00
|
|
|
trace_btrfs_space_reservation(root->fs_info, "transaction",
|
2012-03-29 21:57:44 +08:00
|
|
|
trans->transid, trans->bytes_reserved, 0);
|
2011-10-15 02:40:17 +08:00
|
|
|
btrfs_block_rsv_release(root, trans->block_rsv, trans->bytes_reserved);
|
2010-05-16 22:48:46 +08:00
|
|
|
trans->bytes_reserved = 0;
|
|
|
|
}
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
/* Can only return 0 or -ENOSPC */
|
2010-05-16 22:49:58 +08:00
|
|
|
int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
struct btrfs_block_rsv *src_rsv = get_block_rsv(trans, root);
|
|
|
|
struct btrfs_block_rsv *dst_rsv = root->orphan_block_rsv;
|
|
|
|
|
|
|
|
/*
|
2011-05-03 22:40:22 +08:00
|
|
|
* We need to hold space in order to delete our orphan item once we've
|
|
|
|
* added it, so this takes the reservation so we can release it later
|
|
|
|
* when we are truly done with the orphan item.
|
2010-05-16 22:49:58 +08:00
|
|
|
*/
|
2011-05-28 19:00:39 +08:00
|
|
|
u64 num_bytes = btrfs_calc_trans_metadata_size(root, 1);
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(root->fs_info, "orphan",
|
|
|
|
btrfs_ino(inode), num_bytes, 1);
|
2010-05-16 22:49:58 +08:00
|
|
|
return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
void btrfs_orphan_release_metadata(struct inode *inode)
|
2009-04-22 05:40:57 +08:00
|
|
|
{
|
2010-05-16 22:49:58 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2011-05-28 19:00:39 +08:00
|
|
|
u64 num_bytes = btrfs_calc_trans_metadata_size(root, 1);
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(root->fs_info, "orphan",
|
|
|
|
btrfs_ino(inode), num_bytes, 0);
|
2010-05-16 22:49:58 +08:00
|
|
|
btrfs_block_rsv_release(root, root->orphan_block_rsv, num_bytes);
|
|
|
|
}
|
2009-04-22 05:40:57 +08:00
|
|
|
|
2013-02-28 18:04:33 +08:00
|
|
|
/*
|
|
|
|
* btrfs_subvolume_reserve_metadata() - reserve space for subvolume operation
|
|
|
|
* root: the root of the parent directory
|
|
|
|
* rsv: block reservation
|
|
|
|
* items: the number of items that we need do reservation
|
|
|
|
* qgroup_reserved: used to return the reserved size in qgroup
|
|
|
|
*
|
|
|
|
* This function is used to reserve the space for snapshot/subvolume
|
|
|
|
* creation and deletion. Those operations are different with the
|
|
|
|
* common file/directory operations, they change two fs/file trees
|
|
|
|
* and root tree, the number of items that the qgroup reserves is
|
|
|
|
* different with the free space reservation. So we can not use
|
|
|
|
* the space reseravtion mechanism in start_transaction().
|
|
|
|
*/
|
|
|
|
int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *rsv,
|
|
|
|
int items,
|
|
|
|
u64 *qgroup_reserved)
|
2010-05-16 22:48:46 +08:00
|
|
|
{
|
2013-02-28 18:04:33 +08:00
|
|
|
u64 num_bytes;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (root->fs_info->quota_enabled) {
|
|
|
|
/* One for parent inode, two for dir entries */
|
|
|
|
num_bytes = 3 * root->leafsize;
|
|
|
|
ret = btrfs_qgroup_reserve(root, num_bytes);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
} else {
|
|
|
|
num_bytes = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
*qgroup_reserved = num_bytes;
|
|
|
|
|
|
|
|
num_bytes = btrfs_calc_trans_metadata_size(root, items);
|
|
|
|
rsv->space_info = __find_space_info(root->fs_info,
|
|
|
|
BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
ret = btrfs_block_rsv_add(root, rsv, num_bytes,
|
|
|
|
BTRFS_RESERVE_FLUSH_ALL);
|
|
|
|
if (ret) {
|
|
|
|
if (*qgroup_reserved)
|
|
|
|
btrfs_qgroup_free(root, *qgroup_reserved);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_subvolume_release_metadata(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *rsv,
|
|
|
|
u64 qgroup_reserved)
|
|
|
|
{
|
|
|
|
btrfs_block_rsv_release(root, rsv, (u64)-1);
|
|
|
|
if (qgroup_reserved)
|
|
|
|
btrfs_qgroup_free(root, qgroup_reserved);
|
2009-04-22 05:40:57 +08:00
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* drop_outstanding_extent - drop an outstanding extent
|
|
|
|
* @inode: the inode we're dropping the extent for
|
|
|
|
*
|
|
|
|
* This is called when we are freeing up an outstanding extent, either called
|
|
|
|
* after an error or after an extent is written. This will return the number of
|
|
|
|
* reserved extents that need to be freed. This must be called with
|
|
|
|
* BTRFS_I(inode)->lock held.
|
|
|
|
*/
|
2011-07-15 23:16:44 +08:00
|
|
|
static unsigned drop_outstanding_extent(struct inode *inode)
|
|
|
|
{
|
2011-11-09 04:47:34 +08:00
|
|
|
unsigned drop_inode_space = 0;
|
2011-07-15 23:16:44 +08:00
|
|
|
unsigned dropped_extents = 0;
|
|
|
|
|
|
|
|
BUG_ON(!BTRFS_I(inode)->outstanding_extents);
|
|
|
|
BTRFS_I(inode)->outstanding_extents--;
|
|
|
|
|
2011-11-09 04:47:34 +08:00
|
|
|
if (BTRFS_I(inode)->outstanding_extents == 0 &&
|
2012-05-24 02:13:11 +08:00
|
|
|
test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
|
|
|
|
&BTRFS_I(inode)->runtime_flags))
|
2011-11-09 04:47:34 +08:00
|
|
|
drop_inode_space = 1;
|
|
|
|
|
2011-07-15 23:16:44 +08:00
|
|
|
/*
|
|
|
|
* If we have more or the same amount of outsanding extents than we have
|
|
|
|
* reserved then we need to leave the reserved extents count alone.
|
|
|
|
*/
|
|
|
|
if (BTRFS_I(inode)->outstanding_extents >=
|
|
|
|
BTRFS_I(inode)->reserved_extents)
|
2011-11-09 04:47:34 +08:00
|
|
|
return drop_inode_space;
|
2011-07-15 23:16:44 +08:00
|
|
|
|
|
|
|
dropped_extents = BTRFS_I(inode)->reserved_extents -
|
|
|
|
BTRFS_I(inode)->outstanding_extents;
|
|
|
|
BTRFS_I(inode)->reserved_extents -= dropped_extents;
|
2011-11-09 04:47:34 +08:00
|
|
|
return dropped_extents + drop_inode_space;
|
2011-07-15 23:16:44 +08:00
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* calc_csum_metadata_size - return the amount of metada space that must be
|
|
|
|
* reserved/free'd for the given bytes.
|
|
|
|
* @inode: the inode we're manipulating
|
|
|
|
* @num_bytes: the number of bytes in question
|
|
|
|
* @reserve: 1 if we are reserving space, 0 if we are freeing space
|
|
|
|
*
|
|
|
|
* This adjusts the number of csum_bytes in the inode and then returns the
|
|
|
|
* correct amount of metadata that must either be reserved or freed. We
|
|
|
|
* calculate how many checksums we can fit into one leaf and then divide the
|
|
|
|
* number of bytes that will need to be checksumed by this value to figure out
|
|
|
|
* how many checksums will be required. If we are adding bytes then the number
|
|
|
|
* may go up and we will return the number of additional bytes that must be
|
|
|
|
* reserved. If it is going down we will return the number of bytes that must
|
|
|
|
* be freed.
|
|
|
|
*
|
|
|
|
* This must be called with BTRFS_I(inode)->lock held.
|
|
|
|
*/
|
|
|
|
static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes,
|
|
|
|
int reserve)
|
2008-03-25 03:01:59 +08:00
|
|
|
{
|
2011-08-04 22:25:02 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
u64 csum_size;
|
|
|
|
int num_csums_per_leaf;
|
|
|
|
int num_csums;
|
|
|
|
int old_csums;
|
|
|
|
|
|
|
|
if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM &&
|
|
|
|
BTRFS_I(inode)->csum_bytes == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
old_csums = (int)div64_u64(BTRFS_I(inode)->csum_bytes, root->sectorsize);
|
|
|
|
if (reserve)
|
|
|
|
BTRFS_I(inode)->csum_bytes += num_bytes;
|
|
|
|
else
|
|
|
|
BTRFS_I(inode)->csum_bytes -= num_bytes;
|
|
|
|
csum_size = BTRFS_LEAF_DATA_SIZE(root) - sizeof(struct btrfs_item);
|
|
|
|
num_csums_per_leaf = (int)div64_u64(csum_size,
|
|
|
|
sizeof(struct btrfs_csum_item) +
|
|
|
|
sizeof(struct btrfs_disk_key));
|
|
|
|
num_csums = (int)div64_u64(BTRFS_I(inode)->csum_bytes, root->sectorsize);
|
|
|
|
num_csums = num_csums + num_csums_per_leaf - 1;
|
|
|
|
num_csums = num_csums / num_csums_per_leaf;
|
|
|
|
|
|
|
|
old_csums = old_csums + num_csums_per_leaf - 1;
|
|
|
|
old_csums = old_csums / num_csums_per_leaf;
|
|
|
|
|
|
|
|
/* No change, no need to reserve more */
|
|
|
|
if (old_csums == num_csums)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (reserve)
|
|
|
|
return btrfs_calc_trans_metadata_size(root,
|
|
|
|
num_csums - old_csums);
|
|
|
|
|
|
|
|
return btrfs_calc_trans_metadata_size(root, old_csums - num_csums);
|
2010-05-16 22:48:47 +08:00
|
|
|
}
|
2008-11-13 03:34:12 +08:00
|
|
|
|
2010-05-16 22:48:47 +08:00
|
|
|
int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
struct btrfs_block_rsv *block_rsv = &root->fs_info->delalloc_block_rsv;
|
2011-07-15 23:16:44 +08:00
|
|
|
u64 to_reserve = 0;
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
u64 csum_bytes;
|
2011-07-15 23:16:44 +08:00
|
|
|
unsigned nr_extents = 0;
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
int extra_reserve = 0;
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
|
2013-01-28 14:26:00 +08:00
|
|
|
int ret = 0;
|
2012-12-15 02:48:14 +08:00
|
|
|
bool delalloc_lock = true;
|
2013-03-01 19:36:01 +08:00
|
|
|
u64 to_free = 0;
|
|
|
|
unsigned dropped;
|
2008-03-25 03:01:59 +08:00
|
|
|
|
2012-12-15 02:48:14 +08:00
|
|
|
/* If we are a free space inode we need to not flush since we will be in
|
|
|
|
* the middle of a transaction commit. We also don't need the delalloc
|
|
|
|
* mutex since we won't race with anybody. We need this mostly to make
|
|
|
|
* lockdep shut its filthy mouth.
|
|
|
|
*/
|
|
|
|
if (btrfs_is_free_space_inode(inode)) {
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
flush = BTRFS_RESERVE_NO_FLUSH;
|
2012-12-15 02:48:14 +08:00
|
|
|
delalloc_lock = false;
|
|
|
|
}
|
2011-08-30 22:19:10 +08:00
|
|
|
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
if (flush != BTRFS_RESERVE_NO_FLUSH &&
|
|
|
|
btrfs_transaction_in_commit(root->fs_info))
|
2010-05-16 22:48:47 +08:00
|
|
|
schedule_timeout(1);
|
2008-04-29 03:29:52 +08:00
|
|
|
|
2012-12-15 02:48:14 +08:00
|
|
|
if (delalloc_lock)
|
|
|
|
mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
|
|
|
|
|
2010-05-16 22:48:47 +08:00
|
|
|
num_bytes = ALIGN(num_bytes, root->sectorsize);
|
2010-10-16 04:52:49 +08:00
|
|
|
|
2011-07-15 23:16:44 +08:00
|
|
|
spin_lock(&BTRFS_I(inode)->lock);
|
|
|
|
BTRFS_I(inode)->outstanding_extents++;
|
|
|
|
|
|
|
|
if (BTRFS_I(inode)->outstanding_extents >
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
BTRFS_I(inode)->reserved_extents)
|
2011-07-15 23:16:44 +08:00
|
|
|
nr_extents = BTRFS_I(inode)->outstanding_extents -
|
|
|
|
BTRFS_I(inode)->reserved_extents;
|
2011-01-26 05:30:38 +08:00
|
|
|
|
2011-11-09 04:47:34 +08:00
|
|
|
/*
|
|
|
|
* Add an item to reserve for updating the inode when we complete the
|
|
|
|
* delalloc io.
|
|
|
|
*/
|
2012-05-24 02:13:11 +08:00
|
|
|
if (!test_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
|
|
|
|
&BTRFS_I(inode)->runtime_flags)) {
|
2011-11-09 04:47:34 +08:00
|
|
|
nr_extents++;
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
extra_reserve = 1;
|
2008-03-26 04:50:33 +08:00
|
|
|
}
|
2011-11-09 04:47:34 +08:00
|
|
|
|
|
|
|
to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents);
|
2011-08-04 22:25:02 +08:00
|
|
|
to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
csum_bytes = BTRFS_I(inode)->csum_bytes;
|
2011-07-15 23:16:44 +08:00
|
|
|
spin_unlock(&BTRFS_I(inode)->lock);
|
2011-01-26 05:30:38 +08:00
|
|
|
|
2013-03-01 19:36:01 +08:00
|
|
|
if (root->fs_info->quota_enabled) {
|
2011-09-14 21:44:05 +08:00
|
|
|
ret = btrfs_qgroup_reserve(root, num_bytes +
|
|
|
|
nr_extents * root->leafsize);
|
2013-03-01 19:36:01 +08:00
|
|
|
if (ret)
|
|
|
|
goto out_fail;
|
|
|
|
}
|
2011-09-14 21:44:05 +08:00
|
|
|
|
2013-03-01 19:36:01 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, to_reserve, flush);
|
|
|
|
if (unlikely(ret)) {
|
|
|
|
if (root->fs_info->quota_enabled)
|
2013-03-01 19:33:01 +08:00
|
|
|
btrfs_qgroup_free(root, num_bytes +
|
|
|
|
nr_extents * root->leafsize);
|
2013-03-01 19:36:01 +08:00
|
|
|
goto out_fail;
|
2011-07-15 23:16:44 +08:00
|
|
|
}
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
spin_lock(&BTRFS_I(inode)->lock);
|
|
|
|
if (extra_reserve) {
|
2012-05-24 02:13:11 +08:00
|
|
|
set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
|
|
|
|
&BTRFS_I(inode)->runtime_flags);
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
nr_extents--;
|
|
|
|
}
|
|
|
|
BTRFS_I(inode)->reserved_extents += nr_extents;
|
|
|
|
spin_unlock(&BTRFS_I(inode)->lock);
|
2012-12-15 02:48:14 +08:00
|
|
|
|
|
|
|
if (delalloc_lock)
|
|
|
|
mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
|
2012-01-10 23:31:31 +08:00
|
|
|
if (to_reserve)
|
|
|
|
trace_btrfs_space_reservation(root->fs_info,"delalloc",
|
|
|
|
btrfs_ino(inode), to_reserve, 1);
|
2010-05-16 22:48:47 +08:00
|
|
|
block_rsv_add_bytes(block_rsv, to_reserve, 1);
|
|
|
|
|
|
|
|
return 0;
|
2013-03-01 19:36:01 +08:00
|
|
|
|
|
|
|
out_fail:
|
|
|
|
spin_lock(&BTRFS_I(inode)->lock);
|
|
|
|
dropped = drop_outstanding_extent(inode);
|
|
|
|
/*
|
|
|
|
* If the inodes csum_bytes is the same as the original
|
|
|
|
* csum_bytes then we know we haven't raced with any free()ers
|
|
|
|
* so we can just reduce our inodes csum bytes and carry on.
|
|
|
|
*/
|
2013-03-26 04:03:35 +08:00
|
|
|
if (BTRFS_I(inode)->csum_bytes == csum_bytes) {
|
2013-03-01 19:36:01 +08:00
|
|
|
calc_csum_metadata_size(inode, num_bytes, 0);
|
2013-03-26 04:03:35 +08:00
|
|
|
} else {
|
|
|
|
u64 orig_csum_bytes = BTRFS_I(inode)->csum_bytes;
|
|
|
|
u64 bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is tricky, but first we need to figure out how much we
|
|
|
|
* free'd from any free-ers that occured during this
|
|
|
|
* reservation, so we reset ->csum_bytes to the csum_bytes
|
|
|
|
* before we dropped our lock, and then call the free for the
|
|
|
|
* number of bytes that were freed while we were trying our
|
|
|
|
* reservation.
|
|
|
|
*/
|
|
|
|
bytes = csum_bytes - BTRFS_I(inode)->csum_bytes;
|
|
|
|
BTRFS_I(inode)->csum_bytes = csum_bytes;
|
|
|
|
to_free = calc_csum_metadata_size(inode, bytes, 0);
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now we need to see how much we would have freed had we not
|
|
|
|
* been making this reservation and our ->csum_bytes were not
|
|
|
|
* artificially inflated.
|
|
|
|
*/
|
|
|
|
BTRFS_I(inode)->csum_bytes = csum_bytes - num_bytes;
|
|
|
|
bytes = csum_bytes - orig_csum_bytes;
|
|
|
|
bytes = calc_csum_metadata_size(inode, bytes, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now reset ->csum_bytes to what it should be. If bytes is
|
|
|
|
* more than to_free then we would have free'd more space had we
|
|
|
|
* not had an artificially high ->csum_bytes, so we need to free
|
|
|
|
* the remainder. If bytes is the same or less then we don't
|
|
|
|
* need to do anything, the other free-ers did the correct
|
|
|
|
* thing.
|
|
|
|
*/
|
|
|
|
BTRFS_I(inode)->csum_bytes = orig_csum_bytes - num_bytes;
|
|
|
|
if (bytes > to_free)
|
|
|
|
to_free = bytes - to_free;
|
|
|
|
else
|
|
|
|
to_free = 0;
|
|
|
|
}
|
2013-03-01 19:36:01 +08:00
|
|
|
spin_unlock(&BTRFS_I(inode)->lock);
|
|
|
|
if (dropped)
|
|
|
|
to_free += btrfs_calc_trans_metadata_size(root, dropped);
|
|
|
|
|
|
|
|
if (to_free) {
|
|
|
|
btrfs_block_rsv_release(root, block_rsv, to_free);
|
|
|
|
trace_btrfs_space_reservation(root->fs_info, "delalloc",
|
|
|
|
btrfs_ino(inode), to_free, 0);
|
|
|
|
}
|
|
|
|
if (delalloc_lock)
|
|
|
|
mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
|
|
|
|
return ret;
|
2010-05-16 22:48:47 +08:00
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* btrfs_delalloc_release_metadata - release a metadata reservation for an inode
|
|
|
|
* @inode: the inode to release the reservation for
|
|
|
|
* @num_bytes: the number of bytes we're releasing
|
|
|
|
*
|
|
|
|
* This will release the metadata reservation for an inode. This can be called
|
|
|
|
* once we complete IO for a given set of bytes to release their metadata
|
|
|
|
* reservations.
|
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2011-07-15 23:16:44 +08:00
|
|
|
u64 to_free = 0;
|
|
|
|
unsigned dropped;
|
2010-05-16 22:48:47 +08:00
|
|
|
|
|
|
|
num_bytes = ALIGN(num_bytes, root->sectorsize);
|
2011-08-04 22:25:02 +08:00
|
|
|
spin_lock(&BTRFS_I(inode)->lock);
|
2011-07-15 23:16:44 +08:00
|
|
|
dropped = drop_outstanding_extent(inode);
|
2009-04-22 05:40:57 +08:00
|
|
|
|
2013-02-07 18:12:07 +08:00
|
|
|
if (num_bytes)
|
|
|
|
to_free = calc_csum_metadata_size(inode, num_bytes, 0);
|
2011-08-04 22:25:02 +08:00
|
|
|
spin_unlock(&BTRFS_I(inode)->lock);
|
2011-07-15 23:16:44 +08:00
|
|
|
if (dropped > 0)
|
|
|
|
to_free += btrfs_calc_trans_metadata_size(root, dropped);
|
2010-05-16 22:48:47 +08:00
|
|
|
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(root->fs_info, "delalloc",
|
|
|
|
btrfs_ino(inode), to_free, 0);
|
2011-09-14 21:44:05 +08:00
|
|
|
if (root->fs_info->quota_enabled) {
|
|
|
|
btrfs_qgroup_free(root, num_bytes +
|
|
|
|
dropped * root->leafsize);
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:48:47 +08:00
|
|
|
btrfs_block_rsv_release(root, &root->fs_info->delalloc_block_rsv,
|
|
|
|
to_free);
|
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* btrfs_delalloc_reserve_space - reserve data and metadata space for delalloc
|
|
|
|
* @inode: inode we're writing to
|
|
|
|
* @num_bytes: the number of bytes we want to allocate
|
|
|
|
*
|
|
|
|
* This will do the following things
|
|
|
|
*
|
|
|
|
* o reserve space in the data space info for num_bytes
|
|
|
|
* o reserve space in the metadata space info based on number of outstanding
|
|
|
|
* extents and how much csums will be needed
|
|
|
|
* o add to the inodes ->delalloc_bytes
|
|
|
|
* o add it to the fs_info's delalloc inodes list.
|
|
|
|
*
|
|
|
|
* This will return 0 for success and -ENOSPC if there is no space left.
|
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = btrfs_check_data_free_space(inode, num_bytes);
|
2009-01-06 10:25:51 +08:00
|
|
|
if (ret)
|
2010-05-16 22:48:47 +08:00
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = btrfs_delalloc_reserve_metadata(inode, num_bytes);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_free_reserved_data_space(inode, num_bytes);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* btrfs_delalloc_release_space - release data and metadata space for delalloc
|
|
|
|
* @inode: inode we're releasing space for
|
|
|
|
* @num_bytes: the number of bytes we want to free up
|
|
|
|
*
|
|
|
|
* This must be matched with a call to btrfs_delalloc_reserve_space. This is
|
|
|
|
* called in the case that we don't need the metadata AND data reservations
|
|
|
|
* anymore. So if there is an error or we insert an inline extent.
|
|
|
|
*
|
|
|
|
* This function will release the metadata space that was not used and will
|
|
|
|
* decrement ->delalloc_bytes and remove it from the fs_info delalloc_inodes
|
|
|
|
* list if there are no delalloc bytes left.
|
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes)
|
|
|
|
{
|
|
|
|
btrfs_delalloc_release_metadata(inode, num_bytes);
|
|
|
|
btrfs_free_reserved_data_space(inode, num_bytes);
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
|
|
|
|
2012-12-27 17:01:19 +08:00
|
|
|
static int update_block_group(struct btrfs_root *root,
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 bytenr, u64 num_bytes, int alloc)
|
2007-04-27 04:46:15 +08:00
|
|
|
{
|
2010-06-22 02:48:16 +08:00
|
|
|
struct btrfs_block_group_cache *cache = NULL;
|
2007-04-27 04:46:15 +08:00
|
|
|
struct btrfs_fs_info *info = root->fs_info;
|
2007-10-16 04:15:53 +08:00
|
|
|
u64 total = num_bytes;
|
2007-04-27 04:46:15 +08:00
|
|
|
u64 old_val;
|
2007-10-16 04:15:53 +08:00
|
|
|
u64 byte_in_group;
|
2010-06-22 02:48:16 +08:00
|
|
|
int factor;
|
2007-05-08 08:03:49 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
/* block accounting for super block */
|
|
|
|
spin_lock(&info->delalloc_lock);
|
2011-04-13 21:41:04 +08:00
|
|
|
old_val = btrfs_super_bytes_used(info->super_copy);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (alloc)
|
|
|
|
old_val += num_bytes;
|
|
|
|
else
|
|
|
|
old_val -= num_bytes;
|
2011-04-13 21:41:04 +08:00
|
|
|
btrfs_set_super_bytes_used(info->super_copy, old_val);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
spin_unlock(&info->delalloc_lock);
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (total) {
|
2007-10-16 04:15:53 +08:00
|
|
|
cache = btrfs_lookup_block_group(info, bytenr);
|
2008-11-13 03:19:50 +08:00
|
|
|
if (!cache)
|
2012-03-12 23:03:00 +08:00
|
|
|
return -ENOENT;
|
2010-05-16 22:46:24 +08:00
|
|
|
if (cache->flags & (BTRFS_BLOCK_GROUP_DUP |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
factor = 2;
|
|
|
|
else
|
|
|
|
factor = 1;
|
2010-08-26 04:54:15 +08:00
|
|
|
/*
|
|
|
|
* If this block group has free space cache written out, we
|
|
|
|
* need to make sure to load it if we are removing space. This
|
|
|
|
* is because we need the unpinning stage to actually add the
|
|
|
|
* space back to the block group, otherwise we will leak space.
|
|
|
|
*/
|
|
|
|
if (!alloc && cache->cached == BTRFS_CACHE_NO)
|
2012-12-27 17:01:18 +08:00
|
|
|
cache_block_group(cache, 1);
|
2010-06-22 02:48:16 +08:00
|
|
|
|
2007-10-16 04:15:53 +08:00
|
|
|
byte_in_group = bytenr - cache->key.objectid;
|
|
|
|
WARN_ON(byte_in_group > cache->key.offset);
|
2007-04-27 04:46:15 +08:00
|
|
|
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_lock(&cache->space_info->lock);
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_lock(&cache->lock);
|
2010-06-22 02:48:16 +08:00
|
|
|
|
2011-10-04 02:07:49 +08:00
|
|
|
if (btrfs_test_opt(root, SPACE_CACHE) &&
|
2010-06-22 02:48:16 +08:00
|
|
|
cache->disk_cache_state < BTRFS_DC_CLEAR)
|
|
|
|
cache->disk_cache_state = BTRFS_DC_CLEAR;
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
cache->dirty = 1;
|
2007-04-27 04:46:15 +08:00
|
|
|
old_val = btrfs_block_group_used(&cache->item);
|
2007-10-16 04:15:53 +08:00
|
|
|
num_bytes = min(total, cache->key.offset - byte_in_group);
|
2007-04-27 22:08:34 +08:00
|
|
|
if (alloc) {
|
2007-10-16 04:15:53 +08:00
|
|
|
old_val += num_bytes;
|
2009-09-12 04:11:19 +08:00
|
|
|
btrfs_set_block_group_used(&cache->item, old_val);
|
|
|
|
cache->reserved -= num_bytes;
|
|
|
|
cache->space_info->bytes_reserved -= num_bytes;
|
2010-05-16 22:46:24 +08:00
|
|
|
cache->space_info->bytes_used += num_bytes;
|
|
|
|
cache->space_info->disk_used += num_bytes * factor;
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_unlock(&cache->lock);
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&cache->space_info->lock);
|
2007-04-27 22:08:34 +08:00
|
|
|
} else {
|
2007-10-16 04:15:53 +08:00
|
|
|
old_val -= num_bytes;
|
2008-07-23 11:06:41 +08:00
|
|
|
btrfs_set_block_group_used(&cache->item, old_val);
|
2010-05-16 22:46:25 +08:00
|
|
|
cache->pinned += num_bytes;
|
|
|
|
cache->space_info->bytes_pinned += num_bytes;
|
2008-03-25 03:01:59 +08:00
|
|
|
cache->space_info->bytes_used -= num_bytes;
|
2010-05-16 22:46:24 +08:00
|
|
|
cache->space_info->disk_used -= num_bytes * factor;
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_unlock(&cache->lock);
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&cache->space_info->lock);
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
set_extent_dirty(info->pinned_extents,
|
|
|
|
bytenr, bytenr + num_bytes - 1,
|
|
|
|
GFP_NOFS | __GFP_NOFAIL);
|
2007-04-27 22:08:34 +08:00
|
|
|
}
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2007-10-16 04:15:53 +08:00
|
|
|
total -= num_bytes;
|
|
|
|
bytenr += num_bytes;
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2008-03-25 03:01:59 +08:00
|
|
|
|
2008-05-07 23:43:44 +08:00
|
|
|
static u64 first_logical_byte(struct btrfs_root *root, u64 search_start)
|
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2008-12-12 05:30:39 +08:00
|
|
|
u64 bytenr;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2012-12-27 17:01:23 +08:00
|
|
|
spin_lock(&root->fs_info->block_group_cache_lock);
|
|
|
|
bytenr = root->fs_info->first_logical_byte;
|
|
|
|
spin_unlock(&root->fs_info->block_group_cache_lock);
|
|
|
|
|
|
|
|
if (bytenr < (u64)-1)
|
|
|
|
return bytenr;
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, search_start);
|
|
|
|
if (!cache)
|
2008-05-07 23:43:44 +08:00
|
|
|
return 0;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
bytenr = cache->key.objectid;
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2008-12-12 05:30:39 +08:00
|
|
|
|
|
|
|
return bytenr;
|
2008-05-07 23:43:44 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
static int pin_down_extent(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache,
|
|
|
|
u64 bytenr, u64 num_bytes, int reserved)
|
2007-11-17 03:57:08 +08:00
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
spin_lock(&cache->space_info->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
cache->pinned += num_bytes;
|
|
|
|
cache->space_info->bytes_pinned += num_bytes;
|
|
|
|
if (reserved) {
|
|
|
|
cache->reserved -= num_bytes;
|
|
|
|
cache->space_info->bytes_reserved -= num_bytes;
|
|
|
|
}
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&cache->space_info->lock);
|
Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.
One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 01:57:01 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
set_extent_dirty(root->fs_info->pinned_extents, bytenr,
|
|
|
|
bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
|
|
|
|
return 0;
|
|
|
|
}
|
Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.
One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 01:57:01 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
/*
|
|
|
|
* this function must be called within transaction
|
|
|
|
*/
|
|
|
|
int btrfs_pin_extent(struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, int reserved)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache;
|
Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.
One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 01:57:01 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
cache = btrfs_lookup_block_group(root->fs_info, bytenr);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(!cache); /* Logic error */
|
2010-05-16 22:46:25 +08:00
|
|
|
|
|
|
|
pin_down_extent(root, cache, bytenr, num_bytes, reserved);
|
|
|
|
|
|
|
|
btrfs_put_block_group(cache);
|
2009-09-12 04:11:19 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
/*
|
2011-11-01 08:52:39 +08:00
|
|
|
* this function must be called within transaction
|
|
|
|
*/
|
2012-12-27 17:01:20 +08:00
|
|
|
int btrfs_pin_extent_for_log_replay(struct btrfs_root *root,
|
2011-11-01 08:52:39 +08:00
|
|
|
u64 bytenr, u64 num_bytes)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(root->fs_info, bytenr);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(!cache); /* Logic error */
|
2011-11-01 08:52:39 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* pull in the free space cache (if any) so that our pin
|
|
|
|
* removes the free space from the cache. We have load_only set
|
|
|
|
* to one because the slow code to read in the free extents does check
|
|
|
|
* the pinned extents.
|
|
|
|
*/
|
2012-12-27 17:01:18 +08:00
|
|
|
cache_block_group(cache, 1);
|
2011-11-01 08:52:39 +08:00
|
|
|
|
|
|
|
pin_down_extent(root, cache, bytenr, num_bytes, 0);
|
|
|
|
|
|
|
|
/* remove us from the free space cache (if we're there at all) */
|
|
|
|
btrfs_remove_free_space(cache, bytenr, num_bytes);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-07-27 05:00:46 +08:00
|
|
|
/**
|
|
|
|
* btrfs_update_reserved_bytes - update the block_group and space info counters
|
|
|
|
* @cache: The cache we are manipulating
|
|
|
|
* @num_bytes: The number of bytes in question
|
|
|
|
* @reserve: One of the reservation enums
|
|
|
|
*
|
|
|
|
* This is called by the allocator when it reserves space, or by somebody who is
|
|
|
|
* freeing space that was never actually used on disk. For example if you
|
|
|
|
* reserve some space for a new leaf in transaction A and before transaction A
|
|
|
|
* commits you free that leaf, you call this with reserve set to 0 in order to
|
|
|
|
* clear the reservation.
|
|
|
|
*
|
|
|
|
* Metadata reservations should be called with RESERVE_ALLOC so we do the proper
|
|
|
|
* ENOSPC accounting. For data we handle the reservation through clearing the
|
|
|
|
* delalloc bits in the io_tree. We have to do this since we could end up
|
|
|
|
* allocating less disk space for the amount of data we have reserved in the
|
|
|
|
* case of compression.
|
|
|
|
*
|
|
|
|
* If this is a reservation and the block group has become read only we cannot
|
|
|
|
* make the reservation and return -EAGAIN, otherwise this function always
|
|
|
|
* succeeds.
|
2010-05-16 22:46:25 +08:00
|
|
|
*/
|
2011-07-27 05:00:46 +08:00
|
|
|
static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
|
|
|
|
u64 num_bytes, int reserve)
|
2009-09-12 04:11:19 +08:00
|
|
|
{
|
2011-07-27 05:00:46 +08:00
|
|
|
struct btrfs_space_info *space_info = cache->space_info;
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret = 0;
|
2012-03-12 23:03:00 +08:00
|
|
|
|
2011-07-27 05:00:46 +08:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (reserve != RESERVE_FREE) {
|
2010-05-16 22:46:25 +08:00
|
|
|
if (cache->ro) {
|
|
|
|
ret = -EAGAIN;
|
|
|
|
} else {
|
2011-07-27 05:00:46 +08:00
|
|
|
cache->reserved += num_bytes;
|
|
|
|
space_info->bytes_reserved += num_bytes;
|
|
|
|
if (reserve == RESERVE_ALLOC) {
|
2012-01-10 23:31:31 +08:00
|
|
|
trace_btrfs_space_reservation(cache->fs_info,
|
2012-03-29 21:57:44 +08:00
|
|
|
"space_info", space_info->flags,
|
|
|
|
num_bytes, 0);
|
2011-07-27 05:00:46 +08:00
|
|
|
space_info->bytes_may_use -= num_bytes;
|
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2011-07-27 05:00:46 +08:00
|
|
|
} else {
|
|
|
|
if (cache->ro)
|
|
|
|
space_info->bytes_readonly += num_bytes;
|
|
|
|
cache->reserved -= num_bytes;
|
|
|
|
space_info->bytes_reserved -= num_bytes;
|
|
|
|
space_info->reservation_progress++;
|
2007-11-17 03:57:08 +08:00
|
|
|
}
|
2011-07-27 05:00:46 +08:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
return ret;
|
2007-11-17 03:57:08 +08:00
|
|
|
}
|
2007-04-27 04:46:15 +08:00
|
|
|
|
2012-03-01 21:56:26 +08:00
|
|
|
void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_root *root)
|
2008-09-26 22:05:48 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_caching_control *next;
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
struct btrfs_block_group_cache *cache;
|
2008-09-26 22:05:48 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
down_write(&fs_info->extent_commit_sem);
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
list_for_each_entry_safe(caching_ctl, next,
|
|
|
|
&fs_info->caching_block_groups, list) {
|
|
|
|
cache = caching_ctl->block_group;
|
|
|
|
if (block_group_cache_done(cache)) {
|
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
|
|
|
list_del_init(&caching_ctl->list);
|
|
|
|
put_caching_control(caching_ctl);
|
2008-09-26 22:05:48 +08:00
|
|
|
} else {
|
2009-09-12 04:11:19 +08:00
|
|
|
cache->last_byte_to_unpin = caching_ctl->progress;
|
2008-09-26 22:05:48 +08:00
|
|
|
}
|
|
|
|
}
|
2009-09-12 04:11:19 +08:00
|
|
|
|
|
|
|
if (fs_info->pinned_extents == &fs_info->freed_extents[0])
|
|
|
|
fs_info->pinned_extents = &fs_info->freed_extents[1];
|
|
|
|
else
|
|
|
|
fs_info->pinned_extents = &fs_info->freed_extents[0];
|
|
|
|
|
|
|
|
up_write(&fs_info->extent_commit_sem);
|
2010-05-16 22:49:58 +08:00
|
|
|
|
|
|
|
update_global_block_rsv(fs_info);
|
2008-09-26 22:05:48 +08:00
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
|
2007-06-29 03:57:36 +08:00
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct btrfs_block_group_cache *cache = NULL;
|
2012-10-23 03:52:28 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
|
2009-09-12 04:11:19 +08:00
|
|
|
u64 len;
|
2012-10-23 03:52:28 +08:00
|
|
|
bool readonly;
|
2007-06-29 03:57:36 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
while (start <= end) {
|
2012-10-23 03:52:28 +08:00
|
|
|
readonly = false;
|
2009-09-12 04:11:19 +08:00
|
|
|
if (!cache ||
|
|
|
|
start >= cache->key.objectid + cache->key.offset) {
|
|
|
|
if (cache)
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, start);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(!cache); /* Logic error */
|
2009-09-12 04:11:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
len = cache->key.objectid + cache->key.offset - start;
|
|
|
|
len = min(len, end + 1 - start);
|
|
|
|
|
|
|
|
if (start < cache->last_byte_to_unpin) {
|
|
|
|
len = min(len, cache->last_byte_to_unpin - start);
|
|
|
|
btrfs_add_free_space(cache, start, len);
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
start += len;
|
2012-10-23 03:52:28 +08:00
|
|
|
space_info = cache->space_info;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2012-10-23 03:52:28 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2009-09-12 04:11:19 +08:00
|
|
|
spin_lock(&cache->lock);
|
|
|
|
cache->pinned -= len;
|
2012-10-23 03:52:28 +08:00
|
|
|
space_info->bytes_pinned -= len;
|
|
|
|
if (cache->ro) {
|
|
|
|
space_info->bytes_readonly += len;
|
|
|
|
readonly = true;
|
|
|
|
}
|
2009-09-12 04:11:19 +08:00
|
|
|
spin_unlock(&cache->lock);
|
2012-10-23 03:52:28 +08:00
|
|
|
if (!readonly && global_rsv->space_info == space_info) {
|
|
|
|
spin_lock(&global_rsv->lock);
|
|
|
|
if (!global_rsv->full) {
|
|
|
|
len = min(len, global_rsv->size -
|
|
|
|
global_rsv->reserved);
|
|
|
|
global_rsv->reserved += len;
|
|
|
|
space_info->bytes_may_use += len;
|
|
|
|
if (global_rsv->reserved >= global_rsv->size)
|
|
|
|
global_rsv->full = 1;
|
|
|
|
}
|
|
|
|
spin_unlock(&global_rsv->lock);
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
2007-06-29 03:57:36 +08:00
|
|
|
}
|
2009-09-12 04:11:19 +08:00
|
|
|
|
|
|
|
if (cache)
|
|
|
|
btrfs_put_block_group(cache);
|
2007-06-29 03:57:36 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_root *root)
|
2007-03-07 09:08:01 +08:00
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct extent_io_tree *unpin;
|
2007-10-16 04:15:26 +08:00
|
|
|
u64 start;
|
|
|
|
u64 end;
|
2007-03-07 09:08:01 +08:00
|
|
|
int ret;
|
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
if (trans->aborted)
|
|
|
|
return 0;
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (fs_info->pinned_extents == &fs_info->freed_extents[0])
|
|
|
|
unpin = &fs_info->freed_extents[1];
|
|
|
|
else
|
|
|
|
unpin = &fs_info->freed_extents[0];
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2007-10-16 04:15:26 +08:00
|
|
|
ret = find_first_extent_bit(unpin, 0, &start, &end,
|
2012-09-28 05:07:30 +08:00
|
|
|
EXTENT_DIRTY, NULL);
|
2007-10-16 04:15:26 +08:00
|
|
|
if (ret)
|
2007-03-07 09:08:01 +08:00
|
|
|
break;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2011-03-24 18:24:27 +08:00
|
|
|
if (btrfs_test_opt(root, DISCARD))
|
|
|
|
ret = btrfs_discard_extent(root, start,
|
|
|
|
end + 1 - start, NULL);
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2007-10-16 04:15:26 +08:00
|
|
|
clear_extent_dirty(unpin, start, end, GFP_NOFS);
|
2009-09-12 04:11:19 +08:00
|
|
|
unpin_extent_range(root, start, end);
|
2009-03-13 23:00:37 +08:00
|
|
|
cond_resched();
|
2007-03-07 09:08:01 +08:00
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2007-03-23 00:13:20 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner_objectid,
|
|
|
|
u64 owner_offset, int refs_to_drop,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
2007-03-07 09:08:01 +08:00
|
|
|
{
|
2007-03-13 04:22:34 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_path *path;
|
2007-03-21 08:35:03 +08:00
|
|
|
struct btrfs_fs_info *info = root->fs_info;
|
|
|
|
struct btrfs_root *extent_root = info->extent_root;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
2007-03-07 09:08:01 +08:00
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int is_data;
|
2008-02-19 05:33:44 +08:00
|
|
|
int extent_slot = 0;
|
|
|
|
int found_extent = 0;
|
|
|
|
int num_to_del = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 item_size;
|
|
|
|
u64 refs;
|
2007-03-08 00:50:24 +08:00
|
|
|
|
2007-04-02 23:20:42 +08:00
|
|
|
path = btrfs_alloc_path();
|
2007-06-23 02:16:25 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2007-04-05 22:38:44 +08:00
|
|
|
|
2008-04-22 00:01:38 +08:00
|
|
|
path->reada = 1;
|
2009-03-13 23:00:37 +08:00
|
|
|
path->leave_spinning = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
|
|
|
|
BUG_ON(!is_data && refs_to_drop != 1);
|
|
|
|
|
|
|
|
ret = lookup_extent_backref(trans, extent_root, path, &iref,
|
|
|
|
bytenr, num_bytes, parent,
|
|
|
|
root_objectid, owner_objectid,
|
|
|
|
owner_offset);
|
2007-12-11 22:25:06 +08:00
|
|
|
if (ret == 0) {
|
2008-02-19 05:33:44 +08:00
|
|
|
extent_slot = path->slots[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
while (extent_slot >= 0) {
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key,
|
2008-02-19 05:33:44 +08:00
|
|
|
extent_slot);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (key.objectid != bytenr)
|
2008-02-19 05:33:44 +08:00
|
|
|
break;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (key.type == BTRFS_EXTENT_ITEM_KEY &&
|
|
|
|
key.offset == num_bytes) {
|
2008-02-19 05:33:44 +08:00
|
|
|
found_extent = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (path->slots[0] - extent_slot > 5)
|
|
|
|
break;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
extent_slot--;
|
2008-02-19 05:33:44 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
item_size = btrfs_item_size_nr(path->nodes[0], extent_slot);
|
|
|
|
if (found_extent && item_size < sizeof(*ei))
|
|
|
|
found_extent = 0;
|
|
|
|
#endif
|
2008-09-24 01:14:14 +08:00
|
|
|
if (!found_extent) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG_ON(iref);
|
2009-03-13 22:10:06 +08:00
|
|
|
ret = remove_extent_backref(trans, extent_root, path,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
NULL, refs_to_drop,
|
|
|
|
is_data);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-03-13 23:00:37 +08:00
|
|
|
path->leave_spinning = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
key.objectid = bytenr;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
key.offset = num_bytes;
|
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
ret = btrfs_search_slot(trans, extent_root,
|
|
|
|
&key, path, -1, 1);
|
2008-11-13 03:19:50 +08:00
|
|
|
if (ret) {
|
|
|
|
printk(KERN_ERR "umm, got %d back from search"
|
2009-01-06 10:25:51 +08:00
|
|
|
", was looking for %llu\n", ret,
|
|
|
|
(unsigned long long)bytenr);
|
2011-07-13 23:03:50 +08:00
|
|
|
if (ret > 0)
|
|
|
|
btrfs_print_leaf(extent_root,
|
|
|
|
path->nodes[0]);
|
2008-11-13 03:19:50 +08:00
|
|
|
}
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
extent_slot = path->slots[0];
|
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
} else if (ret == -ENOENT) {
|
2007-12-11 22:25:06 +08:00
|
|
|
btrfs_print_leaf(extent_root, path->nodes[0]);
|
|
|
|
WARN_ON(1);
|
2009-01-06 10:25:51 +08:00
|
|
|
printk(KERN_ERR "btrfs unable to find ref byte nr %llu "
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
"parent %llu root %llu owner %llu offset %llu\n",
|
2009-01-06 10:25:51 +08:00
|
|
|
(unsigned long long)bytenr,
|
2009-03-13 22:10:06 +08:00
|
|
|
(unsigned long long)parent,
|
2009-01-06 10:25:51 +08:00
|
|
|
(unsigned long long)root_objectid,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
(unsigned long long)owner_objectid,
|
|
|
|
(unsigned long long)owner_offset);
|
2012-03-12 23:03:00 +08:00
|
|
|
} else {
|
2012-09-18 21:52:32 +08:00
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
goto out;
|
2007-12-11 22:25:06 +08:00
|
|
|
}
|
2007-10-16 04:14:19 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
item_size = btrfs_item_size_nr(leaf, extent_slot);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (item_size < sizeof(*ei)) {
|
|
|
|
BUG_ON(found_extent || extent_slot != path->slots[0]);
|
|
|
|
ret = convert_extent_item_v0(trans, extent_root, path,
|
|
|
|
owner_objectid, 0);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path->leave_spinning = 1;
|
|
|
|
|
|
|
|
key.objectid = bytenr;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
key.offset = num_bytes;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, extent_root, &key, path,
|
|
|
|
-1, 1);
|
|
|
|
if (ret) {
|
|
|
|
printk(KERN_ERR "umm, got %d back from search"
|
|
|
|
", was looking for %llu\n", ret,
|
|
|
|
(unsigned long long)bytenr);
|
|
|
|
btrfs_print_leaf(extent_root, path->nodes[0]);
|
|
|
|
}
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
extent_slot = path->slots[0];
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, extent_slot);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
BUG_ON(item_size < sizeof(*ei));
|
2008-02-19 05:33:44 +08:00
|
|
|
ei = btrfs_item_ptr(leaf, extent_slot,
|
2007-03-15 02:14:43 +08:00
|
|
|
struct btrfs_extent_item);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (owner_objectid < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
struct btrfs_tree_block_info *bi;
|
|
|
|
BUG_ON(item_size < sizeof(*ei) + sizeof(*bi));
|
|
|
|
bi = (struct btrfs_tree_block_info *)(ei + 1);
|
|
|
|
WARN_ON(owner_objectid != btrfs_tree_block_level(leaf, bi));
|
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
refs = btrfs_extent_refs(leaf, ei);
|
2009-03-13 22:10:06 +08:00
|
|
|
BUG_ON(refs < refs_to_drop);
|
|
|
|
refs -= refs_to_drop;
|
2007-10-16 04:14:19 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (refs > 0) {
|
|
|
|
if (extent_op)
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, ei);
|
|
|
|
/*
|
|
|
|
* In the case of inline back ref, reference count will
|
|
|
|
* be updated by remove_extent_backref
|
2008-02-19 05:33:44 +08:00
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (iref) {
|
|
|
|
BUG_ON(!found_extent);
|
|
|
|
} else {
|
|
|
|
btrfs_set_extent_refs(leaf, ei, refs);
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
}
|
|
|
|
if (found_extent) {
|
|
|
|
ret = remove_extent_backref(trans, extent_root, path,
|
|
|
|
iref, refs_to_drop,
|
|
|
|
is_data);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
2008-02-19 05:33:44 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else {
|
|
|
|
if (found_extent) {
|
|
|
|
BUG_ON(is_data && refs_to_drop !=
|
|
|
|
extent_data_ref_count(root, path, iref));
|
|
|
|
if (iref) {
|
|
|
|
BUG_ON(path->slots[0] != extent_slot);
|
|
|
|
} else {
|
|
|
|
BUG_ON(path->slots[0] != extent_slot + 1);
|
|
|
|
path->slots[0] = extent_slot;
|
|
|
|
num_to_del = 2;
|
|
|
|
}
|
2007-03-25 23:35:08 +08:00
|
|
|
}
|
2009-03-13 23:00:37 +08:00
|
|
|
|
2008-02-19 05:33:44 +08:00
|
|
|
ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
|
|
|
|
num_to_del);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-08-12 21:13:26 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (is_data) {
|
2008-12-10 22:10:46 +08:00
|
|
|
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
2008-12-10 22:10:46 +08:00
|
|
|
}
|
|
|
|
|
2012-12-27 17:01:19 +08:00
|
|
|
ret = update_block_group(root, bytenr, num_bytes, 0);
|
2012-09-18 21:52:32 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
goto out;
|
|
|
|
}
|
2007-03-07 09:08:01 +08:00
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
out:
|
2007-04-02 23:20:42 +08:00
|
|
|
btrfs_free_path(path);
|
2007-03-07 09:08:01 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:11:24 +08:00
|
|
|
/*
|
2010-05-16 22:46:25 +08:00
|
|
|
* when we free an block, it is possible (and likely) that we free the last
|
2009-03-13 22:11:24 +08:00
|
|
|
* delayed ref for that extent as well. This searches the delayed ref tree for
|
|
|
|
* a given extent, and if there are no other delayed refs to be processed, it
|
|
|
|
* removes it from the tree.
|
|
|
|
*/
|
|
|
|
static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytenr)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_head *head;
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
struct rb_node *node;
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret = 0;
|
2009-03-13 22:11:24 +08:00
|
|
|
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
head = btrfs_find_delayed_ref_head(trans, bytenr);
|
|
|
|
if (!head)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
node = rb_prev(&head->node.rb_node);
|
|
|
|
if (!node)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
|
|
|
|
|
|
|
|
/* there are still entries for this ref, we can't drop it */
|
|
|
|
if (ref->bytenr == bytenr)
|
|
|
|
goto out;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (head->extent_op) {
|
|
|
|
if (!head->must_insert_reserved)
|
|
|
|
goto out;
|
2012-11-21 10:21:28 +08:00
|
|
|
btrfs_free_delayed_extent_op(head->extent_op);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
head->extent_op = NULL;
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:11:24 +08:00
|
|
|
/*
|
|
|
|
* waiting for the lock here would deadlock. If someone else has it
|
|
|
|
* locked they are already in the process of dropping it anyway
|
|
|
|
*/
|
|
|
|
if (!mutex_trylock(&head->mutex))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* at this point we have a head with no other entries. Go
|
|
|
|
* ahead and process it.
|
|
|
|
*/
|
|
|
|
head->node.in_tree = 0;
|
|
|
|
rb_erase(&head->node.rb_node, &delayed_refs->root);
|
2009-03-13 22:17:05 +08:00
|
|
|
|
2009-03-13 22:11:24 +08:00
|
|
|
delayed_refs->num_entries--;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we don't take a ref on the node because we're removing it from the
|
|
|
|
* tree, so we just steal the ref the tree was holding.
|
|
|
|
*/
|
2009-03-13 22:17:05 +08:00
|
|
|
delayed_refs->num_heads--;
|
|
|
|
if (list_empty(&head->cluster))
|
|
|
|
delayed_refs->num_heads_ready--;
|
|
|
|
|
|
|
|
list_del_init(&head->cluster);
|
2009-03-13 22:11:24 +08:00
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
BUG_ON(head->extent_op);
|
|
|
|
if (head->must_insert_reserved)
|
|
|
|
ret = 1;
|
|
|
|
|
|
|
|
mutex_unlock(&head->mutex);
|
2009-03-13 22:11:24 +08:00
|
|
|
btrfs_put_delayed_ref(&head->node);
|
2010-05-16 22:46:25 +08:00
|
|
|
return ret;
|
2009-03-13 22:11:24 +08:00
|
|
|
out:
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf,
|
2012-05-16 23:04:52 +08:00
|
|
|
u64 parent, int last_ref)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache = NULL;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_add_delayed_tree_ref(root->fs_info, trans,
|
|
|
|
buf->start, buf->len,
|
|
|
|
parent, root->root_key.objectid,
|
|
|
|
btrfs_header_level(buf),
|
2012-05-16 23:04:52 +08:00
|
|
|
BTRFS_DROP_DELAYED_REF, NULL, 0);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!last_ref)
|
|
|
|
return;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(root->fs_info, buf->start);
|
|
|
|
|
|
|
|
if (btrfs_header_generation(buf) == trans->transid) {
|
|
|
|
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
|
|
|
|
ret = check_ref_cleanup(trans, root, buf->start);
|
|
|
|
if (!ret)
|
2011-08-05 22:25:38 +08:00
|
|
|
goto out;
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
|
|
|
|
pin_down_extent(root, cache, buf->start, buf->len, 1);
|
2011-08-05 22:25:38 +08:00
|
|
|
goto out;
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
|
|
|
|
|
|
|
|
btrfs_add_free_space(cache, buf->start, buf->len);
|
2011-07-27 05:00:46 +08:00
|
|
|
btrfs_update_reserved_bytes(cache, buf->len, RESERVE_FREE);
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
out:
|
2011-03-17 01:42:43 +08:00
|
|
|
/*
|
|
|
|
* Deleting the buffer, clear the corrupt flag since it doesn't matter
|
|
|
|
* anymore.
|
|
|
|
*/
|
|
|
|
clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
|
2010-05-16 22:46:25 +08:00
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
}
|
|
|
|
|
2012-03-12 23:03:00 +08:00
|
|
|
/* Can return -ENOMEM */
|
2011-09-12 21:26:38 +08:00
|
|
|
int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
|
|
|
|
u64 owner, u64 offset, int for_cow)
|
2008-06-26 04:01:30 +08:00
|
|
|
{
|
|
|
|
int ret;
|
2011-09-12 21:26:38 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
/*
|
|
|
|
* tree log blocks never actually go into the extent allocation
|
|
|
|
* tree, just update pinning info and exit early.
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (root_objectid == BTRFS_TREE_LOG_OBJECTID) {
|
|
|
|
WARN_ON(owner >= BTRFS_FIRST_FREE_OBJECTID);
|
2009-03-13 23:00:37 +08:00
|
|
|
/* unlocks the pinned mutex */
|
2009-09-12 04:11:19 +08:00
|
|
|
btrfs_pin_extent(root, bytenr, num_bytes, 1);
|
2009-03-13 22:10:06 +08:00
|
|
|
ret = 0;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr,
|
|
|
|
num_bytes,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
parent, root_objectid, (int)owner,
|
2011-09-12 21:26:38 +08:00
|
|
|
BTRFS_DROP_DELAYED_REF, NULL, for_cow);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else {
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
|
|
|
|
num_bytes,
|
|
|
|
parent, root_objectid, owner,
|
|
|
|
offset, BTRFS_DROP_DELAYED_REF,
|
|
|
|
NULL, for_cow);
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
2008-06-26 04:01:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
static u64 stripe_align(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache,
|
|
|
|
u64 val, u64 num_bytes)
|
2007-12-01 00:30:34 +08:00
|
|
|
{
|
2013-02-26 16:10:22 +08:00
|
|
|
u64 ret = ALIGN(val, root->stripesize);
|
2007-12-01 00:30:34 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
/*
|
|
|
|
* when we wait for progress in the block group caching, its because
|
|
|
|
* our allocation attempt failed at least once. So, we must sleep
|
|
|
|
* and let some progress happen before we try again.
|
|
|
|
*
|
|
|
|
* This function will sleep at least once waiting for new free space to
|
|
|
|
* show up, and then it will check the block group free space numbers
|
|
|
|
* for our min num_bytes. Another option is to have it go ahead
|
|
|
|
* and look in the rbtree for a free extent of a given size, but this
|
|
|
|
* is a good start.
|
|
|
|
*/
|
|
|
|
static noinline int
|
|
|
|
wait_block_group_cache_progress(struct btrfs_block_group_cache *cache,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_caching_control *caching_ctl;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
caching_ctl = get_caching_control(cache);
|
|
|
|
if (!caching_ctl)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
return 0;
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
wait_event(caching_ctl->wait, block_group_cache_done(cache) ||
|
2011-03-29 13:46:06 +08:00
|
|
|
(cache->free_space_ctl->free_space >= num_bytes));
|
2009-09-12 04:11:19 +08:00
|
|
|
|
|
|
|
put_caching_control(caching_ctl);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int
|
|
|
|
wait_block_group_cache_done(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
|
|
|
|
caching_ctl = get_caching_control(cache);
|
|
|
|
if (!caching_ctl)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
wait_event(caching_ctl->wait, block_group_cache_done(cache));
|
|
|
|
|
|
|
|
put_caching_control(caching_ctl);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-11-21 22:18:10 +08:00
|
|
|
int __get_raid_index(u64 flags)
|
2010-05-16 22:46:24 +08:00
|
|
|
{
|
2012-03-27 22:09:17 +08:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_RAID10)
|
2013-01-17 13:38:51 +08:00
|
|
|
return BTRFS_RAID_RAID10;
|
2012-03-27 22:09:17 +08:00
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_RAID1)
|
2013-01-17 13:38:51 +08:00
|
|
|
return BTRFS_RAID_RAID1;
|
2012-03-27 22:09:17 +08:00
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_DUP)
|
2013-01-17 13:38:51 +08:00
|
|
|
return BTRFS_RAID_DUP;
|
2012-03-27 22:09:17 +08:00
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_RAID0)
|
2013-01-17 13:38:51 +08:00
|
|
|
return BTRFS_RAID_RAID0;
|
2013-01-30 07:40:14 +08:00
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_RAID5)
|
2013-02-21 03:06:05 +08:00
|
|
|
return BTRFS_RAID_RAID5;
|
2013-01-30 07:40:14 +08:00
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_RAID6)
|
2013-02-21 03:06:05 +08:00
|
|
|
return BTRFS_RAID_RAID6;
|
|
|
|
|
|
|
|
return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
|
2010-05-16 22:46:24 +08:00
|
|
|
}
|
|
|
|
|
2012-03-27 22:09:17 +08:00
|
|
|
static int get_block_group_index(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
2012-11-21 22:18:10 +08:00
|
|
|
return __get_raid_index(cache->flags);
|
2012-03-27 22:09:17 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
enum btrfs_loop_type {
|
2012-01-14 04:27:45 +08:00
|
|
|
LOOP_CACHING_NOWAIT = 0,
|
|
|
|
LOOP_CACHING_WAIT = 1,
|
|
|
|
LOOP_ALLOC_CHUNK = 2,
|
|
|
|
LOOP_NO_EMPTY_SIZE = 3,
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
};
|
|
|
|
|
2007-02-26 23:40:21 +08:00
|
|
|
/*
|
|
|
|
* walks the btree of allocated extents and find a hole of a given size.
|
|
|
|
* The key ins is changed to record the hole:
|
|
|
|
* ins->objectid == block start
|
2007-03-16 00:56:47 +08:00
|
|
|
* ins->flags = BTRFS_EXTENT_ITEM_KEY
|
2007-02-26 23:40:21 +08:00
|
|
|
* ins->offset == number of blocks
|
|
|
|
* Any available blocks before search_start are skipped.
|
|
|
|
*/
|
2009-01-06 10:25:51 +08:00
|
|
|
static noinline int find_free_extent(struct btrfs_trans_handle *trans,
|
2008-01-03 23:01:48 +08:00
|
|
|
struct btrfs_root *orig_root,
|
|
|
|
u64 num_bytes, u64 empty_size,
|
|
|
|
u64 hint_byte, struct btrfs_key *ins,
|
2011-06-19 04:26:38 +08:00
|
|
|
u64 data)
|
2007-02-26 23:40:21 +08:00
|
|
|
{
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
int ret = 0;
|
2009-01-06 10:25:51 +08:00
|
|
|
struct btrfs_root *root = orig_root->fs_info->extent_root;
|
2009-04-03 21:47:43 +08:00
|
|
|
struct btrfs_free_cluster *last_ptr = NULL;
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
struct btrfs_block_group_cache *block_group = NULL;
|
2011-12-08 09:08:40 +08:00
|
|
|
struct btrfs_block_group_cache *used_block_group;
|
2012-01-18 23:56:06 +08:00
|
|
|
u64 search_start = 0;
|
2008-03-25 03:02:07 +08:00
|
|
|
int empty_cluster = 2 * 1024 * 1024;
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2009-04-03 21:47:43 +08:00
|
|
|
int loop = 0;
|
2012-12-27 17:01:24 +08:00
|
|
|
int index = __get_raid_index(data);
|
2011-07-27 05:00:46 +08:00
|
|
|
int alloc_type = (data & BTRFS_BLOCK_GROUP_DATA) ?
|
|
|
|
RESERVE_ALLOC_NO_ACCOUNT : RESERVE_ALLOC;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
bool found_uncached_bg = false;
|
2009-09-12 04:11:20 +08:00
|
|
|
bool failed_cluster_refill = false;
|
2009-10-06 22:04:28 +08:00
|
|
|
bool failed_alloc = false;
|
2010-09-17 04:19:09 +08:00
|
|
|
bool use_cluster = true;
|
Btrfs: fix race between multi-task space allocation and caching space
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-09-09 17:34:35 +08:00
|
|
|
bool have_caching_bg = false;
|
2007-02-26 23:40:21 +08:00
|
|
|
|
2007-10-16 04:15:53 +08:00
|
|
|
WARN_ON(num_bytes < root->sectorsize);
|
2007-04-05 03:27:52 +08:00
|
|
|
btrfs_set_key_type(ins, BTRFS_EXTENT_ITEM_KEY);
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
ins->objectid = 0;
|
|
|
|
ins->offset = 0;
|
2007-04-05 03:27:52 +08:00
|
|
|
|
2011-11-10 21:29:20 +08:00
|
|
|
trace_find_free_extent(orig_root, num_bytes, empty_size, data);
|
|
|
|
|
2009-04-03 22:14:19 +08:00
|
|
|
space_info = __find_space_info(root->fs_info, data);
|
2010-03-20 04:49:55 +08:00
|
|
|
if (!space_info) {
|
2011-06-19 04:26:38 +08:00
|
|
|
printk(KERN_ERR "No space info for %llu\n", data);
|
2010-03-20 04:49:55 +08:00
|
|
|
return -ENOSPC;
|
|
|
|
}
|
2009-04-03 22:14:19 +08:00
|
|
|
|
2010-09-17 04:19:09 +08:00
|
|
|
/*
|
|
|
|
* If the space info is for both data and metadata it means we have a
|
|
|
|
* small filesystem and we can't use the clustering stuff.
|
|
|
|
*/
|
|
|
|
if (btrfs_mixed_space_info(space_info))
|
|
|
|
use_cluster = false;
|
|
|
|
|
|
|
|
if (data & BTRFS_BLOCK_GROUP_METADATA && use_cluster) {
|
2009-04-03 21:47:43 +08:00
|
|
|
last_ptr = &root->fs_info->meta_alloc_cluster;
|
2009-02-12 22:41:38 +08:00
|
|
|
if (!btrfs_test_opt(root, SSD))
|
|
|
|
empty_cluster = 64 * 1024;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
|
2010-09-17 04:19:09 +08:00
|
|
|
if ((data & BTRFS_BLOCK_GROUP_DATA) && use_cluster &&
|
|
|
|
btrfs_test_opt(root, SSD)) {
|
2009-04-03 21:47:43 +08:00
|
|
|
last_ptr = &root->fs_info->data_alloc_cluster;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
if (last_ptr) {
|
2009-04-03 21:47:43 +08:00
|
|
|
spin_lock(&last_ptr->lock);
|
|
|
|
if (last_ptr->block_group)
|
|
|
|
hint_byte = last_ptr->window_start;
|
|
|
|
spin_unlock(&last_ptr->lock);
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2009-04-03 21:47:43 +08:00
|
|
|
|
2008-05-07 23:43:44 +08:00
|
|
|
search_start = max(search_start, first_logical_byte(root, 0));
|
2008-03-25 03:02:07 +08:00
|
|
|
search_start = max(search_start, hint_byte);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
if (!last_ptr)
|
2009-04-03 21:47:43 +08:00
|
|
|
empty_cluster = 0;
|
|
|
|
|
2009-04-03 22:14:19 +08:00
|
|
|
if (search_start == hint_byte) {
|
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info,
|
|
|
|
search_start);
|
2011-12-08 09:08:40 +08:00
|
|
|
used_block_group = block_group;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
/*
|
|
|
|
* we don't want to use the block group if it doesn't match our
|
|
|
|
* allocation bits, or if its not cached.
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
*
|
|
|
|
* However if we are re-searching with an ideal block group
|
|
|
|
* picked out then we don't care that the block group is cached.
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
*/
|
|
|
|
if (block_group && block_group_bits(block_group, data) &&
|
2012-01-14 04:27:45 +08:00
|
|
|
block_group->cached != BTRFS_CACHE_NO) {
|
2009-04-03 22:14:19 +08:00
|
|
|
down_read(&space_info->groups_sem);
|
2009-06-05 03:34:51 +08:00
|
|
|
if (list_empty(&block_group->list) ||
|
|
|
|
block_group->ro) {
|
|
|
|
/*
|
|
|
|
* someone is removing this block group,
|
|
|
|
* we can't jump into the have_block_group
|
|
|
|
* target because our list pointers are not
|
|
|
|
* valid
|
|
|
|
*/
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
up_read(&space_info->groups_sem);
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
} else {
|
2010-05-16 22:46:24 +08:00
|
|
|
index = get_block_group_index(block_group);
|
2009-06-05 03:34:51 +08:00
|
|
|
goto have_block_group;
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
}
|
2009-04-03 22:14:19 +08:00
|
|
|
} else if (block_group) {
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2009-04-03 22:14:19 +08:00
|
|
|
}
|
2008-11-08 07:17:11 +08:00
|
|
|
}
|
2009-04-03 22:14:19 +08:00
|
|
|
search:
|
Btrfs: fix race between multi-task space allocation and caching space
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-09-09 17:34:35 +08:00
|
|
|
have_caching_bg = false;
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
down_read(&space_info->groups_sem);
|
2010-05-16 22:46:24 +08:00
|
|
|
list_for_each_entry(block_group, &space_info->block_groups[index],
|
|
|
|
list) {
|
2009-04-03 22:14:18 +08:00
|
|
|
u64 offset;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
int cached;
|
2008-11-11 05:13:54 +08:00
|
|
|
|
2011-12-08 09:08:40 +08:00
|
|
|
used_block_group = block_group;
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_get_block_group(block_group);
|
2009-04-03 22:14:19 +08:00
|
|
|
search_start = block_group->key.objectid;
|
2008-11-08 07:17:11 +08:00
|
|
|
|
2010-12-14 04:06:46 +08:00
|
|
|
/*
|
|
|
|
* this can happen if we end up cycling through all the
|
|
|
|
* raid types, but we want to make sure we only allocate
|
|
|
|
* for the proper type.
|
|
|
|
*/
|
|
|
|
if (!block_group_bits(block_group, data)) {
|
|
|
|
u64 extra = BTRFS_BLOCK_GROUP_DUP |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
2013-01-30 07:40:14 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID5 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID6 |
|
2010-12-14 04:06:46 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID10;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if they asked for extra copies and this block group
|
|
|
|
* doesn't provide them, bail. This does allow us to
|
|
|
|
* fill raid0 from raid1.
|
|
|
|
*/
|
|
|
|
if ((data & extra) && !(block_group->flags & extra))
|
|
|
|
goto loop;
|
|
|
|
}
|
|
|
|
|
2009-04-03 22:14:19 +08:00
|
|
|
have_block_group:
|
2011-11-15 02:52:14 +08:00
|
|
|
cached = block_group_cache_done(block_group);
|
|
|
|
if (unlikely(!cached)) {
|
|
|
|
found_uncached_bg = true;
|
2012-12-27 17:01:18 +08:00
|
|
|
ret = cache_block_group(block_group, 0);
|
2012-03-29 08:31:37 +08:00
|
|
|
BUG_ON(ret < 0);
|
|
|
|
ret = 0;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
|
|
|
|
2008-11-21 01:16:16 +08:00
|
|
|
if (unlikely(block_group->ro))
|
2009-04-03 22:14:19 +08:00
|
|
|
goto loop;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2009-09-12 04:11:20 +08:00
|
|
|
/*
|
2011-12-08 08:50:42 +08:00
|
|
|
* Ok we want to try and use the cluster allocator, so
|
|
|
|
* lets look there
|
2009-09-12 04:11:20 +08:00
|
|
|
*/
|
2011-12-08 08:50:42 +08:00
|
|
|
if (last_ptr) {
|
2013-01-05 04:39:43 +08:00
|
|
|
unsigned long aligned_cluster;
|
2009-04-03 21:47:43 +08:00
|
|
|
/*
|
|
|
|
* the refill lock keeps out other
|
|
|
|
* people trying to start a new cluster
|
|
|
|
*/
|
|
|
|
spin_lock(&last_ptr->refill_lock);
|
2011-12-08 09:08:40 +08:00
|
|
|
used_block_group = last_ptr->block_group;
|
|
|
|
if (used_block_group != block_group &&
|
|
|
|
(!used_block_group ||
|
|
|
|
used_block_group->ro ||
|
|
|
|
!block_group_bits(used_block_group, data))) {
|
|
|
|
used_block_group = block_group;
|
2009-06-05 03:34:51 +08:00
|
|
|
goto refill_cluster;
|
2011-12-08 09:08:40 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (used_block_group != block_group)
|
|
|
|
btrfs_get_block_group(used_block_group);
|
2009-06-05 03:34:51 +08:00
|
|
|
|
2011-12-08 09:08:40 +08:00
|
|
|
offset = btrfs_alloc_from_cluster(used_block_group,
|
|
|
|
last_ptr, num_bytes, used_block_group->key.objectid);
|
2009-04-03 21:47:43 +08:00
|
|
|
if (offset) {
|
|
|
|
/* we have a block, we're done */
|
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
2011-11-10 21:29:20 +08:00
|
|
|
trace_btrfs_reserve_extent_cluster(root,
|
|
|
|
block_group, search_start, num_bytes);
|
2009-04-03 21:47:43 +08:00
|
|
|
goto checks;
|
|
|
|
}
|
|
|
|
|
2011-12-08 09:08:40 +08:00
|
|
|
WARN_ON(last_ptr->block_group != used_block_group);
|
|
|
|
if (used_block_group != block_group) {
|
|
|
|
btrfs_put_block_group(used_block_group);
|
|
|
|
used_block_group = block_group;
|
2009-04-03 21:47:43 +08:00
|
|
|
}
|
2009-06-05 03:34:51 +08:00
|
|
|
refill_cluster:
|
2011-12-08 09:08:40 +08:00
|
|
|
BUG_ON(used_block_group != block_group);
|
2011-12-08 08:50:42 +08:00
|
|
|
/* If we are on LOOP_NO_EMPTY_SIZE, we can't
|
|
|
|
* set up a new clusters, so lets just skip it
|
|
|
|
* and let the allocator find whatever block
|
|
|
|
* it can find. If we reach this point, we
|
|
|
|
* will have tried the cluster allocator
|
|
|
|
* plenty of times and not have found
|
|
|
|
* anything, so we are likely way too
|
|
|
|
* fragmented for the clustering stuff to find
|
Btrfs: test free space only for unclustered allocation
Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained. Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.
I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-12 14:48:19 +08:00
|
|
|
* anything.
|
|
|
|
*
|
|
|
|
* However, if the cluster is taken from the
|
|
|
|
* current block group, release the cluster
|
|
|
|
* first, so that we stand a better chance of
|
|
|
|
* succeeding in the unclustered
|
|
|
|
* allocation. */
|
|
|
|
if (loop >= LOOP_NO_EMPTY_SIZE &&
|
|
|
|
last_ptr->block_group != block_group) {
|
2011-12-08 08:50:42 +08:00
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
|
|
|
goto unclustered_alloc;
|
|
|
|
}
|
|
|
|
|
2009-04-03 21:47:43 +08:00
|
|
|
/*
|
|
|
|
* this cluster didn't work out, free it and
|
|
|
|
* start over
|
|
|
|
*/
|
|
|
|
btrfs_return_cluster_to_free_space(NULL, last_ptr);
|
|
|
|
|
Btrfs: test free space only for unclustered allocation
Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained. Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.
I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-12 14:48:19 +08:00
|
|
|
if (loop >= LOOP_NO_EMPTY_SIZE) {
|
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
|
|
|
goto unclustered_alloc;
|
|
|
|
}
|
|
|
|
|
2013-01-05 04:39:43 +08:00
|
|
|
aligned_cluster = max_t(unsigned long,
|
|
|
|
empty_cluster + empty_size,
|
|
|
|
block_group->full_stripe_len);
|
|
|
|
|
2009-04-03 21:47:43 +08:00
|
|
|
/* allocate a cluster in this block group */
|
2009-06-10 08:28:34 +08:00
|
|
|
ret = btrfs_find_space_cluster(trans, root,
|
2009-04-03 21:47:43 +08:00
|
|
|
block_group, last_ptr,
|
2011-12-01 02:43:00 +08:00
|
|
|
search_start, num_bytes,
|
2013-01-05 04:39:43 +08:00
|
|
|
aligned_cluster);
|
2009-04-03 21:47:43 +08:00
|
|
|
if (ret == 0) {
|
|
|
|
/*
|
|
|
|
* now pull our allocation out of this
|
|
|
|
* cluster
|
|
|
|
*/
|
|
|
|
offset = btrfs_alloc_from_cluster(block_group,
|
|
|
|
last_ptr, num_bytes,
|
|
|
|
search_start);
|
|
|
|
if (offset) {
|
|
|
|
/* we found one, proceed */
|
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
2011-11-10 21:29:20 +08:00
|
|
|
trace_btrfs_reserve_extent_cluster(root,
|
|
|
|
block_group, search_start,
|
|
|
|
num_bytes);
|
2009-04-03 21:47:43 +08:00
|
|
|
goto checks;
|
|
|
|
}
|
2009-09-12 04:11:20 +08:00
|
|
|
} else if (!cached && loop > LOOP_CACHING_NOWAIT
|
|
|
|
&& !failed_cluster_refill) {
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
|
|
|
|
2009-09-12 04:11:20 +08:00
|
|
|
failed_cluster_refill = true;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
wait_block_group_cache_progress(block_group,
|
|
|
|
num_bytes + empty_cluster + empty_size);
|
|
|
|
goto have_block_group;
|
2009-04-03 21:47:43 +08:00
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-04-03 21:47:43 +08:00
|
|
|
/*
|
|
|
|
* at this point we either didn't find a cluster
|
|
|
|
* or we weren't able to allocate a block from our
|
|
|
|
* cluster. Free the cluster we've been trying
|
|
|
|
* to use, and go to the next block group
|
|
|
|
*/
|
2009-09-12 04:11:20 +08:00
|
|
|
btrfs_return_cluster_to_free_space(NULL, last_ptr);
|
2009-04-03 21:47:43 +08:00
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
2009-09-12 04:11:20 +08:00
|
|
|
goto loop;
|
2009-04-03 21:47:43 +08:00
|
|
|
}
|
|
|
|
|
2011-12-08 08:50:42 +08:00
|
|
|
unclustered_alloc:
|
Btrfs: test free space only for unclustered allocation
Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained. Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.
I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-12 14:48:19 +08:00
|
|
|
spin_lock(&block_group->free_space_ctl->tree_lock);
|
|
|
|
if (cached &&
|
|
|
|
block_group->free_space_ctl->free_space <
|
|
|
|
num_bytes + empty_cluster + empty_size) {
|
|
|
|
spin_unlock(&block_group->free_space_ctl->tree_lock);
|
|
|
|
goto loop;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_group->free_space_ctl->tree_lock);
|
|
|
|
|
2009-04-03 22:14:18 +08:00
|
|
|
offset = btrfs_find_space_for_alloc(block_group, search_start,
|
|
|
|
num_bytes, empty_size);
|
2009-10-06 22:04:28 +08:00
|
|
|
/*
|
|
|
|
* If we didn't find a chunk, and we haven't failed on this
|
|
|
|
* block group before, and this block group is in the middle of
|
|
|
|
* caching and we are ok with waiting, then go ahead and wait
|
|
|
|
* for progress to be made, and set failed_alloc to true.
|
|
|
|
*
|
|
|
|
* If failed_alloc is true then we've already waited on this
|
|
|
|
* block group once and should move on to the next block group.
|
|
|
|
*/
|
|
|
|
if (!offset && !failed_alloc && !cached &&
|
|
|
|
loop > LOOP_CACHING_NOWAIT) {
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
wait_block_group_cache_progress(block_group,
|
2009-10-06 22:04:28 +08:00
|
|
|
num_bytes + empty_size);
|
|
|
|
failed_alloc = true;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
goto have_block_group;
|
2009-10-06 22:04:28 +08:00
|
|
|
} else if (!offset) {
|
Btrfs: fix race between multi-task space allocation and caching space
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-09-09 17:34:35 +08:00
|
|
|
if (!cached)
|
|
|
|
have_caching_bg = true;
|
2009-10-06 22:04:28 +08:00
|
|
|
goto loop;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
2009-04-03 21:47:43 +08:00
|
|
|
checks:
|
2013-01-30 07:40:14 +08:00
|
|
|
search_start = stripe_align(root, used_block_group,
|
|
|
|
offset, num_bytes);
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-04-03 22:14:19 +08:00
|
|
|
/* move on to the next group */
|
|
|
|
if (search_start + num_bytes >
|
2011-12-08 09:08:40 +08:00
|
|
|
used_block_group->key.objectid + used_block_group->key.offset) {
|
|
|
|
btrfs_add_free_space(used_block_group, offset, num_bytes);
|
2009-04-03 22:14:19 +08:00
|
|
|
goto loop;
|
2009-04-03 22:14:18 +08:00
|
|
|
}
|
2008-11-11 00:47:09 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
if (offset < search_start)
|
2011-12-08 09:08:40 +08:00
|
|
|
btrfs_add_free_space(used_block_group, offset,
|
2010-05-16 22:46:25 +08:00
|
|
|
search_start - offset);
|
|
|
|
BUG_ON(offset > search_start);
|
2009-04-03 22:14:19 +08:00
|
|
|
|
2011-12-08 09:08:40 +08:00
|
|
|
ret = btrfs_update_reserved_bytes(used_block_group, num_bytes,
|
2011-07-27 05:00:46 +08:00
|
|
|
alloc_type);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (ret == -EAGAIN) {
|
2011-12-08 09:08:40 +08:00
|
|
|
btrfs_add_free_space(used_block_group, offset, num_bytes);
|
2009-04-03 22:14:19 +08:00
|
|
|
goto loop;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
/* we are all good, lets return */
|
2009-04-03 22:14:19 +08:00
|
|
|
ins->objectid = search_start;
|
|
|
|
ins->offset = num_bytes;
|
2008-12-12 05:30:39 +08:00
|
|
|
|
2011-11-10 21:29:20 +08:00
|
|
|
trace_btrfs_reserve_extent(orig_root, block_group,
|
|
|
|
search_start, num_bytes);
|
2011-12-08 09:08:40 +08:00
|
|
|
if (used_block_group != block_group)
|
|
|
|
btrfs_put_block_group(used_block_group);
|
2011-05-12 03:26:06 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2009-04-03 22:14:19 +08:00
|
|
|
break;
|
|
|
|
loop:
|
2009-09-12 04:11:20 +08:00
|
|
|
failed_cluster_refill = false;
|
2009-10-06 22:04:28 +08:00
|
|
|
failed_alloc = false;
|
2010-05-16 22:46:24 +08:00
|
|
|
BUG_ON(index != get_block_group_index(block_group));
|
2011-12-08 09:08:40 +08:00
|
|
|
if (used_block_group != block_group)
|
|
|
|
btrfs_put_block_group(used_block_group);
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2009-04-03 22:14:19 +08:00
|
|
|
}
|
|
|
|
up_read(&space_info->groups_sem);
|
|
|
|
|
Btrfs: fix race between multi-task space allocation and caching space
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-09-09 17:34:35 +08:00
|
|
|
if (!ins->objectid && loop >= LOOP_CACHING_WAIT && have_caching_bg)
|
|
|
|
goto search;
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
if (!ins->objectid && ++index < BTRFS_NR_RAID_TYPES)
|
|
|
|
goto search;
|
|
|
|
|
2012-01-14 04:27:45 +08:00
|
|
|
/*
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
* LOOP_CACHING_NOWAIT, search partially cached block groups, kicking
|
|
|
|
* caching kthreads as we move along
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
* LOOP_CACHING_WAIT, search everything, and wait if our bg is caching
|
|
|
|
* LOOP_ALLOC_CHUNK, force a chunk allocation and try again
|
|
|
|
* LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try
|
|
|
|
* again
|
2009-04-03 21:47:43 +08:00
|
|
|
*/
|
2011-05-28 04:11:38 +08:00
|
|
|
if (!ins->objectid && loop < LOOP_NO_EMPTY_SIZE) {
|
2010-05-16 22:46:24 +08:00
|
|
|
index = 0;
|
2011-05-28 04:11:38 +08:00
|
|
|
loop++;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
if (loop == LOOP_ALLOC_CHUNK) {
|
2012-09-13 02:08:47 +08:00
|
|
|
ret = do_chunk_alloc(trans, root, data,
|
2012-09-12 04:57:25 +08:00
|
|
|
CHUNK_ALLOC_FORCE);
|
|
|
|
/*
|
|
|
|
* Do not bail out on ENOSPC since we
|
|
|
|
* can do more things.
|
|
|
|
*/
|
|
|
|
if (ret < 0 && ret != -ENOSPC) {
|
|
|
|
btrfs_abort_transaction(trans,
|
|
|
|
root, ret);
|
|
|
|
goto out;
|
2011-05-28 04:11:38 +08:00
|
|
|
}
|
2009-04-03 22:14:19 +08:00
|
|
|
}
|
|
|
|
|
2011-05-28 04:11:38 +08:00
|
|
|
if (loop == LOOP_NO_EMPTY_SIZE) {
|
|
|
|
empty_size = 0;
|
|
|
|
empty_cluster = 0;
|
2009-04-03 21:47:43 +08:00
|
|
|
}
|
2011-05-28 04:11:38 +08:00
|
|
|
|
|
|
|
goto search;
|
2009-04-03 22:14:19 +08:00
|
|
|
} else if (!ins->objectid) {
|
|
|
|
ret = -ENOSPC;
|
2011-05-12 03:26:06 +08:00
|
|
|
} else if (ins->objectid) {
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
ret = 0;
|
2007-05-06 22:15:01 +08:00
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
out:
|
2007-05-06 22:15:01 +08:00
|
|
|
|
2007-03-01 05:46:22 +08:00
|
|
|
return ret;
|
2007-02-26 23:40:21 +08:00
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
|
2009-09-12 04:12:44 +08:00
|
|
|
static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
|
|
|
|
int dump_block_groups)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache;
|
2010-05-16 22:46:24 +08:00
|
|
|
int index = 0;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2009-09-12 04:12:44 +08:00
|
|
|
spin_lock(&info->lock);
|
2011-07-27 05:00:46 +08:00
|
|
|
printk(KERN_INFO "space_info %llu has %llu free, is %sfull\n",
|
|
|
|
(unsigned long long)info->flags,
|
2009-01-06 10:25:51 +08:00
|
|
|
(unsigned long long)(info->total_bytes - info->bytes_used -
|
2009-09-12 04:12:44 +08:00
|
|
|
info->bytes_pinned - info->bytes_reserved -
|
2010-05-16 22:49:58 +08:00
|
|
|
info->bytes_readonly),
|
2009-01-06 10:25:51 +08:00
|
|
|
(info->full) ? "" : "not ");
|
2010-05-16 22:49:58 +08:00
|
|
|
printk(KERN_INFO "space_info total=%llu, used=%llu, pinned=%llu, "
|
|
|
|
"reserved=%llu, may_use=%llu, readonly=%llu\n",
|
2009-04-22 03:38:29 +08:00
|
|
|
(unsigned long long)info->total_bytes,
|
2010-05-16 22:49:58 +08:00
|
|
|
(unsigned long long)info->bytes_used,
|
2009-04-22 03:38:29 +08:00
|
|
|
(unsigned long long)info->bytes_pinned,
|
2010-05-16 22:49:58 +08:00
|
|
|
(unsigned long long)info->bytes_reserved,
|
2009-04-22 03:38:29 +08:00
|
|
|
(unsigned long long)info->bytes_may_use,
|
2010-05-16 22:49:58 +08:00
|
|
|
(unsigned long long)info->bytes_readonly);
|
2009-09-12 04:12:44 +08:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
|
|
|
if (!dump_block_groups)
|
|
|
|
return;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
down_read(&info->groups_sem);
|
2010-05-16 22:46:24 +08:00
|
|
|
again:
|
|
|
|
list_for_each_entry(cache, &info->block_groups[index], list) {
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
spin_lock(&cache->lock);
|
2012-07-06 17:31:35 +08:00
|
|
|
printk(KERN_INFO "block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s\n",
|
2009-01-06 10:25:51 +08:00
|
|
|
(unsigned long long)cache->key.objectid,
|
|
|
|
(unsigned long long)cache->key.offset,
|
|
|
|
(unsigned long long)btrfs_block_group_used(&cache->item),
|
|
|
|
(unsigned long long)cache->pinned,
|
2012-07-06 17:31:35 +08:00
|
|
|
(unsigned long long)cache->reserved,
|
|
|
|
cache->ro ? "[readonly]" : "");
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
btrfs_dump_free_space(cache, bytes);
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
}
|
2010-05-16 22:46:24 +08:00
|
|
|
if (++index < BTRFS_NR_RAID_TYPES)
|
|
|
|
goto again;
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
up_read(&info->groups_sem);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
2008-09-26 22:05:48 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 num_bytes, u64 min_alloc_size,
|
|
|
|
u64 empty_size, u64 hint_byte,
|
2012-01-18 23:56:06 +08:00
|
|
|
struct btrfs_key *ins, u64 data)
|
2007-02-26 23:40:21 +08:00
|
|
|
{
|
Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 04:01:12 +08:00
|
|
|
bool final_tried = false;
|
2007-02-26 23:40:21 +08:00
|
|
|
int ret;
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
data = btrfs_get_alloc_profile(root, data);
|
2008-04-14 21:46:10 +08:00
|
|
|
again:
|
2007-10-16 04:15:53 +08:00
|
|
|
WARN_ON(num_bytes < root->sectorsize);
|
|
|
|
ret = find_free_extent(trans, root, num_bytes, empty_size,
|
2012-01-18 23:56:06 +08:00
|
|
|
hint_byte, ins, data);
|
2008-04-17 23:29:12 +08:00
|
|
|
|
Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 04:01:12 +08:00
|
|
|
if (ret == -ENOSPC) {
|
|
|
|
if (!final_tried) {
|
|
|
|
num_bytes = num_bytes >> 1;
|
2012-11-16 08:04:43 +08:00
|
|
|
num_bytes = round_down(num_bytes, root->sectorsize);
|
Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 04:01:12 +08:00
|
|
|
num_bytes = max(num_bytes, min_alloc_size);
|
|
|
|
if (num_bytes == min_alloc_size)
|
|
|
|
final_tried = true;
|
|
|
|
goto again;
|
|
|
|
} else if (btrfs_test_opt(root, ENOSPC_DEBUG)) {
|
|
|
|
struct btrfs_space_info *sinfo;
|
|
|
|
|
|
|
|
sinfo = __find_space_info(root->fs_info, data);
|
|
|
|
printk(KERN_ERR "btrfs allocation failed flags %llu, "
|
|
|
|
"wanted %llu\n", (unsigned long long)data,
|
|
|
|
(unsigned long long)num_bytes);
|
2012-03-01 21:56:28 +08:00
|
|
|
if (sinfo)
|
|
|
|
dump_space_info(sinfo, num_bytes, 1);
|
Btrfs: fix enospc error caused by wrong checks of the chunk
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-27 04:01:12 +08:00
|
|
|
}
|
2008-06-26 04:01:30 +08:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
trace_btrfs_reserved_extent_alloc(root, ins->objectid, ins->offset);
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return ret;
|
2008-07-18 00:53:50 +08:00
|
|
|
}
|
|
|
|
|
2011-11-01 08:52:39 +08:00
|
|
|
static int __btrfs_free_reserved_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 len, int pin)
|
2008-08-02 03:11:20 +08:00
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
int ret = 0;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(root->fs_info, start);
|
|
|
|
if (!cache) {
|
2009-01-06 10:25:51 +08:00
|
|
|
printk(KERN_ERR "Unable to find block group for %llu\n",
|
|
|
|
(unsigned long long)start);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return -ENOSPC;
|
|
|
|
}
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2011-03-24 18:24:27 +08:00
|
|
|
if (btrfs_test_opt(root, DISCARD))
|
|
|
|
ret = btrfs_discard_extent(root, start, len, NULL);
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2011-11-01 08:52:39 +08:00
|
|
|
if (pin)
|
|
|
|
pin_down_extent(root, cache, start, len, 1);
|
|
|
|
else {
|
|
|
|
btrfs_add_free_space(cache, start, len);
|
|
|
|
btrfs_update_reserved_bytes(cache, len, RESERVE_FREE);
|
|
|
|
}
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
trace_btrfs_reserved_extent_free(root, start, len);
|
|
|
|
|
2008-07-18 00:53:50 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-11-01 08:52:39 +08:00
|
|
|
int btrfs_free_reserved_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 len)
|
|
|
|
{
|
|
|
|
return __btrfs_free_reserved_extent(root, start, len, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_free_and_pin_reserved_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 len)
|
|
|
|
{
|
|
|
|
return __btrfs_free_reserved_extent(root, start, len, 1);
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 flags, u64 owner, u64 offset,
|
|
|
|
struct btrfs_key *ins, int ref_mod)
|
2008-07-18 00:53:50 +08:00
|
|
|
{
|
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2008-07-18 00:53:50 +08:00
|
|
|
struct btrfs_extent_item *extent_item;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_extent_inline_ref *iref;
|
2008-07-18 00:53:50 +08:00
|
|
|
struct btrfs_path *path;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct extent_buffer *leaf;
|
|
|
|
int type;
|
|
|
|
u32 size;
|
2007-08-09 08:17:12 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (parent > 0)
|
|
|
|
type = BTRFS_SHARED_DATA_REF_KEY;
|
|
|
|
else
|
|
|
|
type = BTRFS_EXTENT_DATA_REF_KEY;
|
2007-08-30 03:47:34 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type);
|
2007-12-11 22:25:06 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
2011-03-23 16:14:16 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2008-02-02 03:51:59 +08:00
|
|
|
|
2009-03-13 23:00:37 +08:00
|
|
|
path->leave_spinning = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
|
|
|
|
ins, size);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
extent_item = btrfs_item_ptr(leaf, path->slots[0],
|
2008-02-02 03:51:59 +08:00
|
|
|
struct btrfs_extent_item);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_set_extent_refs(leaf, extent_item, ref_mod);
|
|
|
|
btrfs_set_extent_generation(leaf, extent_item, trans->transid);
|
|
|
|
btrfs_set_extent_flags(leaf, extent_item,
|
|
|
|
flags | BTRFS_EXTENT_FLAG_DATA);
|
|
|
|
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
|
|
|
|
btrfs_set_extent_inline_ref_type(leaf, iref, type);
|
|
|
|
if (parent > 0) {
|
|
|
|
struct btrfs_shared_data_ref *ref;
|
|
|
|
ref = (struct btrfs_shared_data_ref *)(iref + 1);
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, parent);
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, ref, ref_mod);
|
|
|
|
} else {
|
|
|
|
struct btrfs_extent_data_ref *ref;
|
|
|
|
ref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
btrfs_set_extent_data_ref_root(leaf, ref, root_objectid);
|
|
|
|
btrfs_set_extent_data_ref_objectid(leaf, ref, owner);
|
|
|
|
btrfs_set_extent_data_ref_offset(leaf, ref, offset);
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, ref, ref_mod);
|
|
|
|
}
|
2008-02-02 03:51:59 +08:00
|
|
|
|
|
|
|
btrfs_mark_buffer_dirty(path->nodes[0]);
|
2007-12-11 22:25:06 +08:00
|
|
|
btrfs_free_path(path);
|
2007-10-16 04:14:48 +08:00
|
|
|
|
2012-12-27 17:01:19 +08:00
|
|
|
ret = update_block_group(root, ins->objectid, ins->offset, 1);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) { /* -ENOENT, logic error */
|
2009-01-06 10:25:51 +08:00
|
|
|
printk(KERN_ERR "btrfs update block group failed for %llu "
|
|
|
|
"%llu\n", (unsigned long long)ins->objectid,
|
|
|
|
(unsigned long long)ins->offset);
|
2008-02-04 23:10:13 +08:00
|
|
|
BUG();
|
|
|
|
}
|
2008-07-18 00:53:50 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 flags, struct btrfs_disk_key *key,
|
|
|
|
int level, struct btrfs_key *ins)
|
2008-07-18 00:53:50 +08:00
|
|
|
{
|
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct btrfs_extent_item *extent_item;
|
|
|
|
struct btrfs_tree_block_info *block_info;
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
u32 size = sizeof(*extent_item) + sizeof(*block_info) + sizeof(*iref);
|
2008-09-24 01:14:13 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path = btrfs_alloc_path();
|
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 01:38:47 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path->leave_spinning = 1;
|
|
|
|
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
|
|
|
|
ins, size);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
extent_item = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_item);
|
|
|
|
btrfs_set_extent_refs(leaf, extent_item, 1);
|
|
|
|
btrfs_set_extent_generation(leaf, extent_item, trans->transid);
|
|
|
|
btrfs_set_extent_flags(leaf, extent_item,
|
|
|
|
flags | BTRFS_EXTENT_FLAG_TREE_BLOCK);
|
|
|
|
block_info = (struct btrfs_tree_block_info *)(extent_item + 1);
|
|
|
|
|
|
|
|
btrfs_set_tree_block_key(leaf, block_info, key);
|
|
|
|
btrfs_set_tree_block_level(leaf, block_info, level);
|
|
|
|
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)(block_info + 1);
|
|
|
|
if (parent > 0) {
|
|
|
|
BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF));
|
|
|
|
btrfs_set_extent_inline_ref_type(leaf, iref,
|
|
|
|
BTRFS_SHARED_BLOCK_REF_KEY);
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, parent);
|
|
|
|
} else {
|
|
|
|
btrfs_set_extent_inline_ref_type(leaf, iref,
|
|
|
|
BTRFS_TREE_BLOCK_REF_KEY);
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, root_objectid);
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
btrfs_free_path(path);
|
|
|
|
|
2012-12-27 17:01:19 +08:00
|
|
|
ret = update_block_group(root, ins->objectid, ins->offset, 1);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) { /* -ENOENT, logic error */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
printk(KERN_ERR "btrfs update block group failed for %llu "
|
|
|
|
"%llu\n", (unsigned long long)ins->objectid,
|
|
|
|
(unsigned long long)ins->offset);
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 root_objectid, u64 owner,
|
|
|
|
u64 offset, struct btrfs_key *ins)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
BUG_ON(root_objectid == BTRFS_TREE_LOG_OBJECTID);
|
|
|
|
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_add_delayed_data_ref(root->fs_info, trans, ins->objectid,
|
|
|
|
ins->offset, 0,
|
|
|
|
root_objectid, owner, offset,
|
|
|
|
BTRFS_ADD_DELAYED_EXTENT, NULL, 0);
|
2008-07-18 00:53:50 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-09-06 04:13:11 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this is used by the tree logging recovery code. It records that
|
|
|
|
* an extent has been allocated and makes sure to clear the free
|
|
|
|
* space cache bits as well
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset,
|
|
|
|
struct btrfs_key *ins)
|
2008-09-06 04:13:11 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
u64 start = ins->objectid;
|
|
|
|
u64 num_bytes = ins->offset;
|
2008-09-06 04:13:11 +08:00
|
|
|
|
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info, ins->objectid);
|
2012-12-27 17:01:18 +08:00
|
|
|
cache_block_group(block_group, 0);
|
2009-09-12 04:11:19 +08:00
|
|
|
caching_ctl = get_caching_control(block_group);
|
2008-09-06 04:13:11 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (!caching_ctl) {
|
|
|
|
BUG_ON(!block_group_cache_done(block_group));
|
|
|
|
ret = btrfs_remove_free_space(block_group, start, num_bytes);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-09-12 04:11:19 +08:00
|
|
|
} else {
|
|
|
|
mutex_lock(&caching_ctl->mutex);
|
|
|
|
|
|
|
|
if (start >= caching_ctl->progress) {
|
|
|
|
ret = add_excluded_extent(root, start, num_bytes);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-09-12 04:11:19 +08:00
|
|
|
} else if (start + num_bytes <= caching_ctl->progress) {
|
|
|
|
ret = btrfs_remove_free_space(block_group,
|
|
|
|
start, num_bytes);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-09-12 04:11:19 +08:00
|
|
|
} else {
|
|
|
|
num_bytes = caching_ctl->progress - start;
|
|
|
|
ret = btrfs_remove_free_space(block_group,
|
|
|
|
start, num_bytes);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-09-12 04:11:19 +08:00
|
|
|
|
|
|
|
start = caching_ctl->progress;
|
|
|
|
num_bytes = ins->objectid + ins->offset -
|
|
|
|
caching_ctl->progress;
|
|
|
|
ret = add_excluded_extent(root, start, num_bytes);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-09-12 04:11:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&caching_ctl->mutex);
|
|
|
|
put_caching_control(caching_ctl);
|
|
|
|
}
|
|
|
|
|
2011-07-27 05:00:46 +08:00
|
|
|
ret = btrfs_update_reserved_bytes(block_group, ins->offset,
|
|
|
|
RESERVE_ALLOC_NO_ACCOUNT);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* logic error */
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = alloc_reserved_file_extent(trans, root, 0, root_objectid,
|
|
|
|
0, owner, offset, ins, 1);
|
2008-09-06 04:13:11 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-08-02 03:11:20 +08:00
|
|
|
struct extent_buffer *btrfs_init_new_buffer(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
2009-02-13 03:09:45 +08:00
|
|
|
u64 bytenr, u32 blocksize,
|
|
|
|
int level)
|
2008-08-02 03:11:20 +08:00
|
|
|
{
|
|
|
|
struct extent_buffer *buf;
|
|
|
|
|
|
|
|
buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
|
|
|
|
if (!buf)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
btrfs_set_header_generation(buf, trans->transid);
|
2011-07-27 04:11:19 +08:00
|
|
|
btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
|
2008-08-02 03:11:20 +08:00
|
|
|
btrfs_tree_lock(buf);
|
|
|
|
clean_tree_block(trans, root, buf);
|
2012-03-10 05:01:49 +08:00
|
|
|
clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 22:25:08 +08:00
|
|
|
|
|
|
|
btrfs_set_lock_blocking(buf);
|
2008-08-02 03:11:20 +08:00
|
|
|
btrfs_set_buffer_uptodate(buf);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 22:25:08 +08:00
|
|
|
|
2008-09-12 04:17:57 +08:00
|
|
|
if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
|
2009-11-12 17:33:26 +08:00
|
|
|
/*
|
|
|
|
* we allow two log transactions at a time, use different
|
|
|
|
* EXENT bit to differentiate dirty pages.
|
|
|
|
*/
|
|
|
|
if (root->log_transid % 2 == 0)
|
|
|
|
set_extent_dirty(&root->dirty_log_pages, buf->start,
|
|
|
|
buf->start + buf->len - 1, GFP_NOFS);
|
|
|
|
else
|
|
|
|
set_extent_new(&root->dirty_log_pages, buf->start,
|
|
|
|
buf->start + buf->len - 1, GFP_NOFS);
|
2008-09-12 04:17:57 +08:00
|
|
|
} else {
|
|
|
|
set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
|
2008-08-02 03:11:20 +08:00
|
|
|
buf->start + buf->len - 1, GFP_NOFS);
|
2008-09-12 04:17:57 +08:00
|
|
|
}
|
2008-08-02 03:11:20 +08:00
|
|
|
trans->blocks_used++;
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 22:25:08 +08:00
|
|
|
/* this returns a buffer locked for blocking */
|
2008-08-02 03:11:20 +08:00
|
|
|
return buf;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
static struct btrfs_block_rsv *
|
|
|
|
use_block_rsv(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u32 blocksize)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *block_rsv;
|
2011-01-25 05:43:20 +08:00
|
|
|
struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
block_rsv = get_block_rsv(trans, root);
|
|
|
|
|
|
|
|
if (block_rsv->size == 0) {
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, blocksize,
|
|
|
|
BTRFS_RESERVE_NO_FLUSH);
|
2011-01-25 05:43:20 +08:00
|
|
|
/*
|
|
|
|
* If we couldn't reserve metadata bytes try and use some from
|
|
|
|
* the global reserve.
|
|
|
|
*/
|
|
|
|
if (ret && block_rsv != global_rsv) {
|
|
|
|
ret = block_rsv_use_bytes(global_rsv, blocksize);
|
|
|
|
if (!ret)
|
|
|
|
return global_rsv;
|
2010-05-16 22:46:25 +08:00
|
|
|
return ERR_PTR(ret);
|
2011-01-25 05:43:20 +08:00
|
|
|
} else if (ret) {
|
2010-05-16 22:46:25 +08:00
|
|
|
return ERR_PTR(ret);
|
2011-01-25 05:43:20 +08:00
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
return block_rsv;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = block_rsv_use_bytes(block_rsv, blocksize);
|
|
|
|
if (!ret)
|
|
|
|
return block_rsv;
|
2012-08-28 05:48:15 +08:00
|
|
|
if (ret && !block_rsv->failfast) {
|
2013-02-09 05:28:17 +08:00
|
|
|
if (btrfs_test_opt(root, ENOSPC_DEBUG)) {
|
|
|
|
static DEFINE_RATELIMIT_STATE(_rs,
|
|
|
|
DEFAULT_RATELIMIT_INTERVAL * 10,
|
|
|
|
/*DEFAULT_RATELIMIT_BURST*/ 1);
|
|
|
|
if (__ratelimit(&_rs))
|
|
|
|
WARN(1, KERN_DEBUG
|
|
|
|
"btrfs: block rsv returned %d\n", ret);
|
|
|
|
}
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, blocksize,
|
|
|
|
BTRFS_RESERVE_NO_FLUSH);
|
2011-01-25 05:43:20 +08:00
|
|
|
if (!ret) {
|
|
|
|
return block_rsv;
|
|
|
|
} else if (ret && block_rsv != global_rsv) {
|
|
|
|
ret = block_rsv_use_bytes(global_rsv, blocksize);
|
|
|
|
if (!ret)
|
|
|
|
return global_rsv;
|
|
|
|
}
|
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
|
|
|
|
return ERR_PTR(-ENOSPC);
|
|
|
|
}
|
|
|
|
|
2012-01-10 23:31:31 +08:00
|
|
|
static void unuse_block_rsv(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_block_rsv *block_rsv, u32 blocksize)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
block_rsv_add_bytes(block_rsv, blocksize, 0);
|
2012-01-10 23:31:31 +08:00
|
|
|
block_rsv_release_bytes(fs_info, block_rsv, NULL, 0);
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
|
2007-02-26 23:40:21 +08:00
|
|
|
/*
|
2010-05-16 22:46:25 +08:00
|
|
|
* finds a free extent and does all the dirty work required for allocation
|
|
|
|
* returns the key for the extent through ins, and a tree buffer for
|
|
|
|
* the first block of the extent through buf.
|
|
|
|
*
|
2007-02-26 23:40:21 +08:00
|
|
|
* returns the tree buffer or NULL.
|
|
|
|
*/
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_root *root, u32 blocksize,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
struct btrfs_disk_key *key, int level,
|
2012-05-16 23:04:52 +08:00
|
|
|
u64 hint, u64 empty_size)
|
2007-02-26 23:40:21 +08:00
|
|
|
{
|
2007-03-13 04:22:34 +08:00
|
|
|
struct btrfs_key ins;
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *buf;
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 flags = 0;
|
|
|
|
int ret;
|
|
|
|
|
2007-02-26 23:40:21 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv = use_block_rsv(trans, root, blocksize);
|
|
|
|
if (IS_ERR(block_rsv))
|
|
|
|
return ERR_CAST(block_rsv);
|
|
|
|
|
|
|
|
ret = btrfs_reserve_extent(trans, root, blocksize, blocksize,
|
2012-01-18 23:56:06 +08:00
|
|
|
empty_size, hint, &ins, 0);
|
2007-02-26 23:40:21 +08:00
|
|
|
if (ret) {
|
2012-01-10 23:31:31 +08:00
|
|
|
unuse_block_rsv(root->fs_info, block_rsv, blocksize);
|
2007-06-23 02:16:25 +08:00
|
|
|
return ERR_PTR(ret);
|
2007-02-26 23:40:21 +08:00
|
|
|
}
|
2008-01-10 04:55:33 +08:00
|
|
|
|
2009-02-13 03:09:45 +08:00
|
|
|
buf = btrfs_init_new_buffer(trans, root, ins.objectid,
|
|
|
|
blocksize, level);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(IS_ERR(buf)); /* -ENOMEM */
|
2010-05-16 22:46:25 +08:00
|
|
|
|
|
|
|
if (root_objectid == BTRFS_TREE_RELOC_OBJECTID) {
|
|
|
|
if (parent == 0)
|
|
|
|
parent = ins.objectid;
|
|
|
|
flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
|
|
|
|
} else
|
|
|
|
BUG_ON(parent > 0);
|
|
|
|
|
|
|
|
if (root_objectid != BTRFS_TREE_LOG_OBJECTID) {
|
|
|
|
struct btrfs_delayed_extent_op *extent_op;
|
2012-11-21 10:21:28 +08:00
|
|
|
extent_op = btrfs_alloc_delayed_extent_op();
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(!extent_op); /* -ENOMEM */
|
2010-05-16 22:46:25 +08:00
|
|
|
if (key)
|
|
|
|
memcpy(&extent_op->key, key, sizeof(extent_op->key));
|
|
|
|
else
|
|
|
|
memset(&extent_op->key, 0, sizeof(extent_op->key));
|
|
|
|
extent_op->flags_to_set = flags;
|
|
|
|
extent_op->update_key = 1;
|
|
|
|
extent_op->update_flags = 1;
|
|
|
|
extent_op->is_data = 0;
|
|
|
|
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_add_delayed_tree_ref(root->fs_info, trans,
|
|
|
|
ins.objectid,
|
2010-05-16 22:46:25 +08:00
|
|
|
ins.offset, parent, root_objectid,
|
|
|
|
level, BTRFS_ADD_DELAYED_EXTENT,
|
2012-05-16 23:04:52 +08:00
|
|
|
extent_op, 0);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2007-02-26 23:40:21 +08:00
|
|
|
return buf;
|
|
|
|
}
|
2007-03-07 09:08:01 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
struct walk_control {
|
|
|
|
u64 refs[BTRFS_MAX_LEVEL];
|
|
|
|
u64 flags[BTRFS_MAX_LEVEL];
|
|
|
|
struct btrfs_key update_progress;
|
|
|
|
int stage;
|
|
|
|
int level;
|
|
|
|
int shared_level;
|
|
|
|
int update_ref;
|
|
|
|
int keep_locks;
|
2009-09-22 03:55:59 +08:00
|
|
|
int reada_slot;
|
|
|
|
int reada_count;
|
2011-09-12 21:26:38 +08:00
|
|
|
int for_reloc;
|
2009-06-28 09:07:35 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
#define DROP_REFERENCE 1
|
|
|
|
#define UPDATE_BACKREF 2
|
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
static noinline void reada_walk_down(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct walk_control *wc,
|
|
|
|
struct btrfs_path *path)
|
2007-03-27 18:33:00 +08:00
|
|
|
{
|
2009-09-22 03:55:59 +08:00
|
|
|
u64 bytenr;
|
|
|
|
u64 generation;
|
|
|
|
u64 refs;
|
2009-10-09 21:25:16 +08:00
|
|
|
u64 flags;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 nritems;
|
2009-09-22 03:55:59 +08:00
|
|
|
u32 blocksize;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *eb;
|
2007-03-27 18:33:00 +08:00
|
|
|
int ret;
|
2009-09-22 03:55:59 +08:00
|
|
|
int slot;
|
|
|
|
int nread = 0;
|
2007-03-27 18:33:00 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (path->slots[wc->level] < wc->reada_slot) {
|
|
|
|
wc->reada_count = wc->reada_count * 2 / 3;
|
|
|
|
wc->reada_count = max(wc->reada_count, 2);
|
|
|
|
} else {
|
|
|
|
wc->reada_count = wc->reada_count * 3 / 2;
|
|
|
|
wc->reada_count = min_t(int, wc->reada_count,
|
|
|
|
BTRFS_NODEPTRS_PER_BLOCK(root));
|
|
|
|
}
|
2007-12-11 22:25:06 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
eb = path->nodes[wc->level];
|
|
|
|
nritems = btrfs_header_nritems(eb);
|
|
|
|
blocksize = btrfs_level_size(root, wc->level - 1);
|
2009-02-04 22:27:02 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
for (slot = path->slots[wc->level]; slot < nritems; slot++) {
|
|
|
|
if (nread >= wc->reada_count)
|
|
|
|
break;
|
2009-02-04 22:27:02 +08:00
|
|
|
|
2008-08-04 20:20:15 +08:00
|
|
|
cond_resched();
|
2009-09-22 03:55:59 +08:00
|
|
|
bytenr = btrfs_node_blockptr(eb, slot);
|
|
|
|
generation = btrfs_node_ptr_generation(eb, slot);
|
2008-08-04 20:20:15 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (slot == path->slots[wc->level])
|
|
|
|
goto reada;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (wc->stage == UPDATE_BACKREF &&
|
|
|
|
generation <= root->root_key.offset)
|
2009-02-04 22:27:02 +08:00
|
|
|
continue;
|
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
/* We don't lock the tree block, it's OK to be racy here */
|
|
|
|
ret = btrfs_lookup_extent_info(trans, root, bytenr, blocksize,
|
|
|
|
&refs, &flags);
|
2012-03-12 23:03:00 +08:00
|
|
|
/* We don't care about errors in readahead. */
|
|
|
|
if (ret < 0)
|
|
|
|
continue;
|
2009-10-09 21:25:16 +08:00
|
|
|
BUG_ON(refs == 0);
|
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
|
|
|
if (refs == 1)
|
|
|
|
goto reada;
|
2009-02-04 22:27:02 +08:00
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
if (wc->level == 1 &&
|
|
|
|
(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF))
|
|
|
|
continue;
|
2009-09-22 03:55:59 +08:00
|
|
|
if (!wc->update_ref ||
|
|
|
|
generation <= root->root_key.offset)
|
|
|
|
continue;
|
|
|
|
btrfs_node_key_to_cpu(eb, &key, slot);
|
|
|
|
ret = btrfs_comp_cpu_keys(&key,
|
|
|
|
&wc->update_progress);
|
|
|
|
if (ret < 0)
|
|
|
|
continue;
|
2009-10-09 21:25:16 +08:00
|
|
|
} else {
|
|
|
|
if (wc->level == 1 &&
|
|
|
|
(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF))
|
|
|
|
continue;
|
2007-03-27 18:33:00 +08:00
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
reada:
|
|
|
|
ret = readahead_tree_block(root, bytenr, blocksize,
|
|
|
|
generation);
|
|
|
|
if (ret)
|
2009-02-04 22:27:02 +08:00
|
|
|
break;
|
2009-09-22 03:55:59 +08:00
|
|
|
nread++;
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
wc->reada_slot = slot;
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2008-10-30 02:49:05 +08:00
|
|
|
/*
|
2009-06-28 09:07:35 +08:00
|
|
|
* hepler to process tree block while walking down the tree.
|
|
|
|
*
|
|
|
|
* when wc->stage == UPDATE_BACKREF, this function updates
|
|
|
|
* back refs for pointers in the block.
|
|
|
|
*
|
|
|
|
* NOTE: return value 1 means we should stop walking down.
|
2008-10-30 02:49:05 +08:00
|
|
|
*/
|
2009-06-28 09:07:35 +08:00
|
|
|
static noinline int walk_down_proc(struct btrfs_trans_handle *trans,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_root *root,
|
2009-06-28 09:07:35 +08:00
|
|
|
struct btrfs_path *path,
|
2009-10-09 21:25:16 +08:00
|
|
|
struct walk_control *wc, int lookup_info)
|
2008-10-30 02:49:05 +08:00
|
|
|
{
|
2009-06-28 09:07:35 +08:00
|
|
|
int level = wc->level;
|
|
|
|
struct extent_buffer *eb = path->nodes[level];
|
|
|
|
u64 flag = BTRFS_BLOCK_FLAG_FULL_BACKREF;
|
2008-10-30 02:49:05 +08:00
|
|
|
int ret;
|
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (wc->stage == UPDATE_BACKREF &&
|
|
|
|
btrfs_header_owner(eb) != root->root_key.objectid)
|
|
|
|
return 1;
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/*
|
|
|
|
* when reference count of tree block is 1, it won't increase
|
|
|
|
* again. once full backref flag is set, we never clear it.
|
|
|
|
*/
|
2009-10-09 21:25:16 +08:00
|
|
|
if (lookup_info &&
|
|
|
|
((wc->stage == DROP_REFERENCE && wc->refs[level] != 1) ||
|
|
|
|
(wc->stage == UPDATE_BACKREF && !(wc->flags[level] & flag)))) {
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(!path->locks[level]);
|
|
|
|
ret = btrfs_lookup_extent_info(trans, root,
|
|
|
|
eb->start, eb->len,
|
|
|
|
&wc->refs[level],
|
|
|
|
&wc->flags[level]);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret == -ENOMEM);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(wc->refs[level] == 0);
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
|
|
|
if (wc->refs[level] > 1)
|
|
|
|
return 1;
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (path->locks[level] && !wc->keep_locks) {
|
2011-07-17 03:23:14 +08:00
|
|
|
btrfs_tree_unlock_rw(eb, path->locks[level]);
|
2009-06-28 09:07:35 +08:00
|
|
|
path->locks[level] = 0;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/* wc->stage == UPDATE_BACKREF */
|
|
|
|
if (!(wc->flags[level] & flag)) {
|
|
|
|
BUG_ON(!path->locks[level]);
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_inc_ref(trans, root, eb, 1, wc->for_reloc);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_dec_ref(trans, root, eb, 0, wc->for_reloc);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = btrfs_set_disk_extent_flags(trans, root, eb->start,
|
|
|
|
eb->len, flag, 0);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-06-28 09:07:35 +08:00
|
|
|
wc->flags[level] |= flag;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the block is shared by multiple trees, so it's not good to
|
|
|
|
* keep the tree lock
|
|
|
|
*/
|
|
|
|
if (path->locks[level] && level > 0) {
|
2011-07-17 03:23:14 +08:00
|
|
|
btrfs_tree_unlock_rw(eb, path->locks[level]);
|
2009-06-28 09:07:35 +08:00
|
|
|
path->locks[level] = 0;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
/*
|
|
|
|
* hepler to process tree block pointer.
|
|
|
|
*
|
|
|
|
* when wc->stage == DROP_REFERENCE, this function checks
|
|
|
|
* reference count of the block pointed to. if the block
|
|
|
|
* is shared and we need update back refs for the subtree
|
|
|
|
* rooted at the block, this function changes wc->stage to
|
|
|
|
* UPDATE_BACKREF. if the block is shared and there is no
|
|
|
|
* need to update back, this function drops the reference
|
|
|
|
* to the block.
|
|
|
|
*
|
|
|
|
* NOTE: return value 1 means we should stop walking down.
|
|
|
|
*/
|
|
|
|
static noinline int do_walk_down(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
2009-10-09 21:25:16 +08:00
|
|
|
struct walk_control *wc, int *lookup_info)
|
2009-09-22 03:55:59 +08:00
|
|
|
{
|
|
|
|
u64 bytenr;
|
|
|
|
u64 generation;
|
|
|
|
u64 parent;
|
|
|
|
u32 blocksize;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *next;
|
|
|
|
int level = wc->level;
|
|
|
|
int reada = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
generation = btrfs_node_ptr_generation(path->nodes[level],
|
|
|
|
path->slots[level]);
|
|
|
|
/*
|
|
|
|
* if the lower level block was created before the snapshot
|
|
|
|
* was created, we know there is no need to update back refs
|
|
|
|
* for the subtree
|
|
|
|
*/
|
|
|
|
if (wc->stage == UPDATE_BACKREF &&
|
2009-10-09 21:25:16 +08:00
|
|
|
generation <= root->root_key.offset) {
|
|
|
|
*lookup_info = 1;
|
2009-09-22 03:55:59 +08:00
|
|
|
return 1;
|
2009-10-09 21:25:16 +08:00
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
|
|
|
|
bytenr = btrfs_node_blockptr(path->nodes[level], path->slots[level]);
|
|
|
|
blocksize = btrfs_level_size(root, level - 1);
|
|
|
|
|
|
|
|
next = btrfs_find_tree_block(root, bytenr, blocksize);
|
|
|
|
if (!next) {
|
|
|
|
next = btrfs_find_create_tree_block(root, bytenr, blocksize);
|
2010-03-25 20:37:12 +08:00
|
|
|
if (!next)
|
|
|
|
return -ENOMEM;
|
2009-09-22 03:55:59 +08:00
|
|
|
reada = 1;
|
|
|
|
}
|
|
|
|
btrfs_tree_lock(next);
|
|
|
|
btrfs_set_lock_blocking(next);
|
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
ret = btrfs_lookup_extent_info(trans, root, bytenr, blocksize,
|
|
|
|
&wc->refs[level - 1],
|
|
|
|
&wc->flags[level - 1]);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_tree_unlock(next);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
BUG_ON(wc->refs[level - 1] == 0);
|
|
|
|
*lookup_info = 0;
|
2009-09-22 03:55:59 +08:00
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
2009-09-22 03:55:59 +08:00
|
|
|
if (wc->refs[level - 1] > 1) {
|
2009-10-09 21:25:16 +08:00
|
|
|
if (level == 1 &&
|
|
|
|
(wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF))
|
|
|
|
goto skip;
|
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (!wc->update_ref ||
|
|
|
|
generation <= root->root_key.offset)
|
|
|
|
goto skip;
|
|
|
|
|
|
|
|
btrfs_node_key_to_cpu(path->nodes[level], &key,
|
|
|
|
path->slots[level]);
|
|
|
|
ret = btrfs_comp_cpu_keys(&key, &wc->update_progress);
|
|
|
|
if (ret < 0)
|
|
|
|
goto skip;
|
|
|
|
|
|
|
|
wc->stage = UPDATE_BACKREF;
|
|
|
|
wc->shared_level = level - 1;
|
|
|
|
}
|
2009-10-09 21:25:16 +08:00
|
|
|
} else {
|
|
|
|
if (level == 1 &&
|
|
|
|
(wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF))
|
|
|
|
goto skip;
|
2009-09-22 03:55:59 +08:00
|
|
|
}
|
|
|
|
|
2012-05-06 19:23:47 +08:00
|
|
|
if (!btrfs_buffer_uptodate(next, generation, 0)) {
|
2009-09-22 03:55:59 +08:00
|
|
|
btrfs_tree_unlock(next);
|
|
|
|
free_extent_buffer(next);
|
|
|
|
next = NULL;
|
2009-10-09 21:25:16 +08:00
|
|
|
*lookup_info = 1;
|
2009-09-22 03:55:59 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!next) {
|
|
|
|
if (reada && level == 1)
|
|
|
|
reada_walk_down(trans, root, wc, path);
|
|
|
|
next = read_tree_block(root, bytenr, blocksize, generation);
|
2011-03-24 14:33:21 +08:00
|
|
|
if (!next)
|
|
|
|
return -EIO;
|
2009-09-22 03:55:59 +08:00
|
|
|
btrfs_tree_lock(next);
|
|
|
|
btrfs_set_lock_blocking(next);
|
|
|
|
}
|
|
|
|
|
|
|
|
level--;
|
|
|
|
BUG_ON(level != btrfs_header_level(next));
|
|
|
|
path->nodes[level] = next;
|
|
|
|
path->slots[level] = 0;
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-09-22 03:55:59 +08:00
|
|
|
wc->level = level;
|
|
|
|
if (wc->level == 1)
|
|
|
|
wc->reada_slot = 0;
|
|
|
|
return 0;
|
|
|
|
skip:
|
|
|
|
wc->refs[level - 1] = 0;
|
|
|
|
wc->flags[level - 1] = 0;
|
2009-10-09 21:25:16 +08:00
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
|
|
|
if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
|
|
|
|
parent = path->nodes[level]->start;
|
|
|
|
} else {
|
|
|
|
BUG_ON(root->root_key.objectid !=
|
|
|
|
btrfs_header_owner(path->nodes[level]));
|
|
|
|
parent = 0;
|
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
ret = btrfs_free_extent(trans, root, bytenr, blocksize, parent,
|
2011-09-12 21:26:38 +08:00
|
|
|
root->root_key.objectid, level - 1, 0, 0);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-09-22 03:55:59 +08:00
|
|
|
}
|
|
|
|
btrfs_tree_unlock(next);
|
|
|
|
free_extent_buffer(next);
|
2009-10-09 21:25:16 +08:00
|
|
|
*lookup_info = 1;
|
2009-09-22 03:55:59 +08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/*
|
|
|
|
* hepler to process tree block while walking up the tree.
|
|
|
|
*
|
|
|
|
* when wc->stage == DROP_REFERENCE, this function drops
|
|
|
|
* reference count on the block.
|
|
|
|
*
|
|
|
|
* when wc->stage == UPDATE_BACKREF, this function changes
|
|
|
|
* wc->stage back to DROP_REFERENCE if we changed wc->stage
|
|
|
|
* to UPDATE_BACKREF previously while processing the block.
|
|
|
|
*
|
|
|
|
* NOTE: return value 1 means we should stop walking up.
|
|
|
|
*/
|
|
|
|
static noinline int walk_up_proc(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct walk_control *wc)
|
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret;
|
2009-06-28 09:07:35 +08:00
|
|
|
int level = wc->level;
|
|
|
|
struct extent_buffer *eb = path->nodes[level];
|
|
|
|
u64 parent = 0;
|
|
|
|
|
|
|
|
if (wc->stage == UPDATE_BACKREF) {
|
|
|
|
BUG_ON(wc->shared_level < level);
|
|
|
|
if (level < wc->shared_level)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = find_next_key(path, level + 1, &wc->update_progress);
|
|
|
|
if (ret > 0)
|
|
|
|
wc->update_ref = 0;
|
|
|
|
|
|
|
|
wc->stage = DROP_REFERENCE;
|
|
|
|
wc->shared_level = -1;
|
|
|
|
path->slots[level] = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* check reference count again if the block isn't locked.
|
|
|
|
* we should start walking down the tree again if reference
|
|
|
|
* count is one.
|
|
|
|
*/
|
|
|
|
if (!path->locks[level]) {
|
|
|
|
BUG_ON(level == 0);
|
|
|
|
btrfs_tree_lock(eb);
|
|
|
|
btrfs_set_lock_blocking(eb);
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
ret = btrfs_lookup_extent_info(trans, root,
|
|
|
|
eb->start, eb->len,
|
|
|
|
&wc->refs[level],
|
|
|
|
&wc->flags[level]);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_tree_unlock_rw(eb, path->locks[level]);
|
2012-12-28 17:33:19 +08:00
|
|
|
path->locks[level] = 0;
|
2012-03-12 23:03:00 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(wc->refs[level] == 0);
|
|
|
|
if (wc->refs[level] == 1) {
|
2011-07-17 03:23:14 +08:00
|
|
|
btrfs_tree_unlock_rw(eb, path->locks[level]);
|
2012-12-28 17:33:19 +08:00
|
|
|
path->locks[level] = 0;
|
2009-06-28 09:07:35 +08:00
|
|
|
return 1;
|
|
|
|
}
|
2008-10-30 02:49:05 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/* wc->stage == DROP_REFERENCE */
|
|
|
|
BUG_ON(wc->refs[level] > 1 && !path->locks[level]);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (wc->refs[level] == 1) {
|
|
|
|
if (level == 0) {
|
|
|
|
if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_dec_ref(trans, root, eb, 1,
|
|
|
|
wc->for_reloc);
|
2009-06-28 09:07:35 +08:00
|
|
|
else
|
2011-09-12 21:26:38 +08:00
|
|
|
ret = btrfs_dec_ref(trans, root, eb, 0,
|
|
|
|
wc->for_reloc);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
|
|
|
/* make block locked assertion in clean_tree_block happy */
|
|
|
|
if (!path->locks[level] &&
|
|
|
|
btrfs_header_generation(eb) == trans->transid) {
|
|
|
|
btrfs_tree_lock(eb);
|
|
|
|
btrfs_set_lock_blocking(eb);
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
|
|
|
clean_tree_block(trans, root, eb);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (eb == root->node) {
|
|
|
|
if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
|
|
|
|
parent = eb->start;
|
|
|
|
else
|
|
|
|
BUG_ON(root->root_key.objectid !=
|
|
|
|
btrfs_header_owner(eb));
|
|
|
|
} else {
|
|
|
|
if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
|
|
|
|
parent = path->nodes[level + 1]->start;
|
|
|
|
else
|
|
|
|
BUG_ON(root->root_key.objectid !=
|
|
|
|
btrfs_header_owner(path->nodes[level + 1]));
|
2008-10-30 02:49:05 +08:00
|
|
|
}
|
|
|
|
|
2012-05-16 23:04:52 +08:00
|
|
|
btrfs_free_tree_block(trans, root, eb, parent, wc->refs[level] == 1);
|
2009-06-28 09:07:35 +08:00
|
|
|
out:
|
|
|
|
wc->refs[level] = 0;
|
|
|
|
wc->flags[level] = 0;
|
2010-05-16 22:46:25 +08:00
|
|
|
return 0;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int walk_down_tree(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct walk_control *wc)
|
|
|
|
{
|
|
|
|
int level = wc->level;
|
2009-10-09 21:25:16 +08:00
|
|
|
int lookup_info = 1;
|
2009-06-28 09:07:35 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
while (level >= 0) {
|
2009-10-09 21:25:16 +08:00
|
|
|
ret = walk_down_proc(trans, root, path, wc, lookup_info);
|
2009-06-28 09:07:35 +08:00
|
|
|
if (ret > 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (level == 0)
|
|
|
|
break;
|
|
|
|
|
2010-02-01 10:41:17 +08:00
|
|
|
if (path->slots[level] >=
|
|
|
|
btrfs_header_nritems(path->nodes[level]))
|
|
|
|
break;
|
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
ret = do_walk_down(trans, root, path, wc, &lookup_info);
|
2009-09-22 03:55:59 +08:00
|
|
|
if (ret > 0) {
|
|
|
|
path->slots[level]++;
|
|
|
|
continue;
|
2010-03-25 20:37:12 +08:00
|
|
|
} else if (ret < 0)
|
|
|
|
return ret;
|
2009-09-22 03:55:59 +08:00
|
|
|
level = wc->level;
|
2008-10-30 02:49:05 +08:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
|
2008-01-03 23:01:48 +08:00
|
|
|
struct btrfs_root *root,
|
2008-10-30 02:49:05 +08:00
|
|
|
struct btrfs_path *path,
|
2009-06-28 09:07:35 +08:00
|
|
|
struct walk_control *wc, int max_level)
|
2007-03-10 19:35:47 +08:00
|
|
|
{
|
2009-06-28 09:07:35 +08:00
|
|
|
int level = wc->level;
|
2007-03-10 19:35:47 +08:00
|
|
|
int ret;
|
2007-08-08 03:52:19 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
path->slots[level] = btrfs_header_nritems(path->nodes[level]);
|
|
|
|
while (level < max_level && path->nodes[level]) {
|
|
|
|
wc->level = level;
|
|
|
|
if (path->slots[level] + 1 <
|
|
|
|
btrfs_header_nritems(path->nodes[level])) {
|
|
|
|
path->slots[level]++;
|
2007-03-10 19:35:47 +08:00
|
|
|
return 0;
|
|
|
|
} else {
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = walk_up_proc(trans, root, path, wc);
|
|
|
|
if (ret > 0)
|
|
|
|
return 0;
|
2009-02-04 22:27:02 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (path->locks[level]) {
|
2011-07-17 03:23:14 +08:00
|
|
|
btrfs_tree_unlock_rw(path->nodes[level],
|
|
|
|
path->locks[level]);
|
2009-06-28 09:07:35 +08:00
|
|
|
path->locks[level] = 0;
|
2008-10-30 02:49:05 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
free_extent_buffer(path->nodes[level]);
|
|
|
|
path->nodes[level] = NULL;
|
|
|
|
level++;
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2007-03-13 23:09:37 +08:00
|
|
|
/*
|
2009-06-28 09:07:35 +08:00
|
|
|
* drop a subvolume tree.
|
|
|
|
*
|
|
|
|
* this function traverses the tree freeing any blocks that only
|
|
|
|
* referenced by the tree.
|
|
|
|
*
|
|
|
|
* when a shared tree block is found. this function decreases its
|
|
|
|
* reference count by one. if update_ref is true, this function
|
|
|
|
* also make sure backrefs for the shared block and all lower level
|
|
|
|
* blocks are properly updated.
|
2007-03-13 23:09:37 +08:00
|
|
|
*/
|
2011-10-04 11:22:41 +08:00
|
|
|
int btrfs_drop_snapshot(struct btrfs_root *root,
|
2011-09-12 21:26:38 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv, int update_ref,
|
|
|
|
int for_reloc)
|
2007-03-10 19:35:47 +08:00
|
|
|
{
|
2007-04-02 23:20:42 +08:00
|
|
|
struct btrfs_path *path;
|
2009-06-28 09:07:35 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_root *tree_root = root->fs_info->tree_root;
|
2007-08-08 03:52:19 +08:00
|
|
|
struct btrfs_root_item *root_item = &root->root_item;
|
2009-06-28 09:07:35 +08:00
|
|
|
struct walk_control *wc;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int err = 0;
|
|
|
|
int ret;
|
|
|
|
int level;
|
2007-03-10 19:35:47 +08:00
|
|
|
|
2007-04-02 23:20:42 +08:00
|
|
|
path = btrfs_alloc_path();
|
2011-08-09 15:11:13 +08:00
|
|
|
if (!path) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2007-03-10 19:35:47 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
wc = kzalloc(sizeof(*wc), GFP_NOFS);
|
2011-07-14 01:59:59 +08:00
|
|
|
if (!wc) {
|
|
|
|
btrfs_free_path(path);
|
2011-08-09 15:11:13 +08:00
|
|
|
err = -ENOMEM;
|
|
|
|
goto out;
|
2011-07-14 01:59:59 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(tree_root, 0);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
err = PTR_ERR(trans);
|
|
|
|
goto out_free;
|
|
|
|
}
|
2011-01-20 14:19:37 +08:00
|
|
|
|
2010-05-16 22:49:59 +08:00
|
|
|
if (block_rsv)
|
|
|
|
trans->block_rsv = block_rsv;
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2007-08-08 03:52:19 +08:00
|
|
|
if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
|
2009-06-28 09:07:35 +08:00
|
|
|
level = btrfs_header_level(root->node);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path->nodes[level] = btrfs_lock_root_node(root);
|
|
|
|
btrfs_set_lock_blocking(path->nodes[level]);
|
2007-08-08 03:52:19 +08:00
|
|
|
path->slots[level] = 0;
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-06-28 09:07:35 +08:00
|
|
|
memset(&wc->update_progress, 0,
|
|
|
|
sizeof(wc->update_progress));
|
2007-08-08 03:52:19 +08:00
|
|
|
} else {
|
|
|
|
btrfs_disk_key_to_cpu(&key, &root_item->drop_progress);
|
2009-06-28 09:07:35 +08:00
|
|
|
memcpy(&wc->update_progress, &key,
|
|
|
|
sizeof(wc->update_progress));
|
|
|
|
|
2007-08-08 04:15:09 +08:00
|
|
|
level = root_item->drop_level;
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(level == 0);
|
2007-08-08 04:15:09 +08:00
|
|
|
path->lowest_level = level;
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
path->lowest_level = 0;
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
2012-03-12 23:03:00 +08:00
|
|
|
goto out_end_trans;
|
2007-08-08 03:52:19 +08:00
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
WARN_ON(ret > 0);
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2008-07-09 02:19:17 +08:00
|
|
|
/*
|
|
|
|
* unlock our path, this is safe because only this
|
|
|
|
* function is allowed to delete this snapshot
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_unlock_up_safe(path, 0);
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
level = btrfs_header_level(root->node);
|
|
|
|
while (1) {
|
|
|
|
btrfs_tree_lock(path->nodes[level]);
|
|
|
|
btrfs_set_lock_blocking(path->nodes[level]);
|
|
|
|
|
|
|
|
ret = btrfs_lookup_extent_info(trans, root,
|
|
|
|
path->nodes[level]->start,
|
|
|
|
path->nodes[level]->len,
|
|
|
|
&wc->refs[level],
|
|
|
|
&wc->flags[level]);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto out_end_trans;
|
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(wc->refs[level] == 0);
|
|
|
|
|
|
|
|
if (level == root_item->drop_level)
|
|
|
|
break;
|
|
|
|
|
|
|
|
btrfs_tree_unlock(path->nodes[level]);
|
|
|
|
WARN_ON(wc->refs[level] != 1);
|
|
|
|
level--;
|
|
|
|
}
|
2007-08-08 03:52:19 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
wc->level = level;
|
|
|
|
wc->shared_level = -1;
|
|
|
|
wc->stage = DROP_REFERENCE;
|
|
|
|
wc->update_ref = update_ref;
|
|
|
|
wc->keep_locks = 0;
|
2011-09-12 21:26:38 +08:00
|
|
|
wc->for_reloc = for_reloc;
|
2009-09-22 03:55:59 +08:00
|
|
|
wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = walk_down_tree(trans, root, path, wc);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
2007-03-10 19:35:47 +08:00
|
|
|
break;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
2007-03-13 23:09:37 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = walk_up_tree(trans, root, path, wc, BTRFS_MAX_LEVEL);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
2007-03-10 19:35:47 +08:00
|
|
|
break;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (ret > 0) {
|
|
|
|
BUG_ON(wc->stage != DROP_REFERENCE);
|
2008-06-26 04:01:31 +08:00
|
|
|
break;
|
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
|
|
|
level = wc->level;
|
|
|
|
btrfs_node_key(path->nodes[level],
|
|
|
|
&root_item->drop_progress,
|
|
|
|
path->slots[level]);
|
|
|
|
root_item->drop_level = level;
|
|
|
|
}
|
|
|
|
|
|
|
|
BUG_ON(wc->level == 0);
|
2010-05-16 22:49:59 +08:00
|
|
|
if (btrfs_should_end_transaction(trans, tree_root)) {
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = btrfs_update_root(trans, tree_root,
|
|
|
|
&root->root_key,
|
|
|
|
root_item);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, tree_root, ret);
|
|
|
|
err = ret;
|
|
|
|
goto out_end_trans;
|
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2010-05-16 22:49:59 +08:00
|
|
|
btrfs_end_transaction_throttle(trans, tree_root);
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(tree_root, 0);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
err = PTR_ERR(trans);
|
|
|
|
goto out_free;
|
|
|
|
}
|
2010-05-16 22:49:59 +08:00
|
|
|
if (block_rsv)
|
|
|
|
trans->block_rsv = block_rsv;
|
2009-03-13 22:17:05 +08:00
|
|
|
}
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (err)
|
|
|
|
goto out_end_trans;
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
ret = btrfs_del_root(trans, tree_root, &root->root_key);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, tree_root, ret);
|
|
|
|
goto out_end_trans;
|
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2009-09-22 04:00:26 +08:00
|
|
|
if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
|
|
|
|
ret = btrfs_find_last_root(tree_root, root->root_key.objectid,
|
|
|
|
NULL, NULL);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_abort_transaction(trans, tree_root, ret);
|
|
|
|
err = ret;
|
|
|
|
goto out_end_trans;
|
|
|
|
} else if (ret > 0) {
|
2010-12-09 01:24:01 +08:00
|
|
|
/* if we fail to delete the orphan item this time
|
|
|
|
* around, it'll get picked up the next time.
|
|
|
|
*
|
|
|
|
* The most common failure here is just -ENOENT.
|
|
|
|
*/
|
|
|
|
btrfs_del_orphan_item(trans, tree_root,
|
|
|
|
root->root_key.objectid);
|
2009-09-22 04:00:26 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (root->in_radix) {
|
|
|
|
btrfs_free_fs_root(tree_root->fs_info, root);
|
|
|
|
} else {
|
|
|
|
free_extent_buffer(root->node);
|
|
|
|
free_extent_buffer(root->commit_root);
|
|
|
|
kfree(root);
|
|
|
|
}
|
2012-03-12 23:03:00 +08:00
|
|
|
out_end_trans:
|
2010-05-16 22:49:59 +08:00
|
|
|
btrfs_end_transaction_throttle(trans, tree_root);
|
2012-03-12 23:03:00 +08:00
|
|
|
out_free:
|
2009-06-28 09:07:35 +08:00
|
|
|
kfree(wc);
|
2007-04-02 23:20:42 +08:00
|
|
|
btrfs_free_path(path);
|
2011-08-09 15:11:13 +08:00
|
|
|
out:
|
|
|
|
if (err)
|
|
|
|
btrfs_std_error(root->fs_info, err);
|
2011-10-04 11:22:41 +08:00
|
|
|
return err;
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
2007-04-27 04:46:15 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/*
|
|
|
|
* drop subtree rooted at tree block 'node'.
|
|
|
|
*
|
|
|
|
* NOTE: this function will unlock and release tree block 'node'
|
2011-09-12 21:26:38 +08:00
|
|
|
* only used by relocation code
|
2009-06-28 09:07:35 +08:00
|
|
|
*/
|
2008-10-30 02:49:05 +08:00
|
|
|
int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *node,
|
|
|
|
struct extent_buffer *parent)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
2009-06-28 09:07:35 +08:00
|
|
|
struct walk_control *wc;
|
2008-10-30 02:49:05 +08:00
|
|
|
int level;
|
|
|
|
int parent_level;
|
|
|
|
int ret = 0;
|
|
|
|
int wret;
|
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID);
|
|
|
|
|
2008-10-30 02:49:05 +08:00
|
|
|
path = btrfs_alloc_path();
|
2011-03-23 16:14:16 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
wc = kzalloc(sizeof(*wc), GFP_NOFS);
|
2011-03-23 16:14:16 +08:00
|
|
|
if (!wc) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2009-03-09 23:45:38 +08:00
|
|
|
btrfs_assert_tree_locked(parent);
|
2008-10-30 02:49:05 +08:00
|
|
|
parent_level = btrfs_header_level(parent);
|
|
|
|
extent_buffer_get(parent);
|
|
|
|
path->nodes[parent_level] = parent;
|
|
|
|
path->slots[parent_level] = btrfs_header_nritems(parent);
|
|
|
|
|
2009-03-09 23:45:38 +08:00
|
|
|
btrfs_assert_tree_locked(node);
|
2008-10-30 02:49:05 +08:00
|
|
|
level = btrfs_header_level(node);
|
|
|
|
path->nodes[level] = node;
|
|
|
|
path->slots[level] = 0;
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
wc->refs[parent_level] = 1;
|
|
|
|
wc->flags[parent_level] = BTRFS_BLOCK_FLAG_FULL_BACKREF;
|
|
|
|
wc->level = level;
|
|
|
|
wc->shared_level = -1;
|
|
|
|
wc->stage = DROP_REFERENCE;
|
|
|
|
wc->update_ref = 0;
|
|
|
|
wc->keep_locks = 1;
|
2011-09-12 21:26:38 +08:00
|
|
|
wc->for_reloc = 1;
|
2009-09-22 03:55:59 +08:00
|
|
|
wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
|
2008-10-30 02:49:05 +08:00
|
|
|
|
|
|
|
while (1) {
|
2009-06-28 09:07:35 +08:00
|
|
|
wret = walk_down_tree(trans, root, path, wc);
|
|
|
|
if (wret < 0) {
|
2008-10-30 02:49:05 +08:00
|
|
|
ret = wret;
|
|
|
|
break;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
wret = walk_up_tree(trans, root, path, wc, parent_level);
|
2008-10-30 02:49:05 +08:00
|
|
|
if (wret < 0)
|
|
|
|
ret = wret;
|
|
|
|
if (wret != 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
kfree(wc);
|
2008-10-30 02:49:05 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-04-29 03:29:52 +08:00
|
|
|
static u64 update_block_group_flags(struct btrfs_root *root, u64 flags)
|
|
|
|
{
|
|
|
|
u64 num_devices;
|
2012-03-27 22:09:17 +08:00
|
|
|
u64 stripped;
|
2012-01-17 04:04:48 +08:00
|
|
|
|
2012-03-27 22:09:17 +08:00
|
|
|
/*
|
|
|
|
* if restripe for this chunk_type is on pick target profile and
|
|
|
|
* return, otherwise do the usual balance
|
|
|
|
*/
|
|
|
|
stripped = get_restripe_target(root->fs_info, flags);
|
|
|
|
if (stripped)
|
|
|
|
return extended_to_chunk(stripped);
|
2012-01-17 04:04:48 +08:00
|
|
|
|
2010-12-14 03:56:23 +08:00
|
|
|
/*
|
|
|
|
* we add in the count of missing devices because we want
|
|
|
|
* to make sure that any RAID levels on a degraded FS
|
|
|
|
* continue to be honored.
|
|
|
|
*/
|
|
|
|
num_devices = root->fs_info->fs_devices->rw_devices +
|
|
|
|
root->fs_info->fs_devices->missing_devices;
|
|
|
|
|
2012-03-27 22:09:17 +08:00
|
|
|
stripped = BTRFS_BLOCK_GROUP_RAID0 |
|
2013-01-30 07:40:14 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
|
2012-03-27 22:09:17 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10;
|
|
|
|
|
2008-04-29 03:29:52 +08:00
|
|
|
if (num_devices == 1) {
|
|
|
|
stripped |= BTRFS_BLOCK_GROUP_DUP;
|
|
|
|
stripped = flags & ~stripped;
|
|
|
|
|
|
|
|
/* turn raid0 into single device chunks */
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_RAID0)
|
|
|
|
return stripped;
|
|
|
|
|
|
|
|
/* turn mirroring into duplication */
|
|
|
|
if (flags & (BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
return stripped | BTRFS_BLOCK_GROUP_DUP;
|
|
|
|
} else {
|
|
|
|
/* they already had raid on here, just return */
|
|
|
|
if (flags & stripped)
|
|
|
|
return flags;
|
|
|
|
|
|
|
|
stripped |= BTRFS_BLOCK_GROUP_DUP;
|
|
|
|
stripped = flags & ~stripped;
|
|
|
|
|
|
|
|
/* switch duplicated blocks with raid1 */
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DUP)
|
|
|
|
return stripped | BTRFS_BLOCK_GROUP_RAID1;
|
|
|
|
|
2012-03-27 22:09:16 +08:00
|
|
|
/* this is drive concat, leave it alone */
|
2008-04-29 03:29:52 +08:00
|
|
|
}
|
2012-03-27 22:09:16 +08:00
|
|
|
|
2008-04-29 03:29:52 +08:00
|
|
|
return flags;
|
|
|
|
}
|
|
|
|
|
2011-07-15 18:34:36 +08:00
|
|
|
static int set_block_group_ro(struct btrfs_block_group_cache *cache, int force)
|
2008-05-25 02:04:53 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_space_info *sinfo = cache->space_info;
|
|
|
|
u64 num_bytes;
|
2011-07-15 18:34:36 +08:00
|
|
|
u64 min_allocable_bytes;
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret = -ENOSPC;
|
2008-05-25 02:04:53 +08:00
|
|
|
|
2008-07-23 11:06:41 +08:00
|
|
|
|
2011-07-15 18:34:36 +08:00
|
|
|
/*
|
|
|
|
* We need some metadata space and system metadata space for
|
|
|
|
* allocating chunks in some corner cases until we force to set
|
|
|
|
* it to be readonly.
|
|
|
|
*/
|
|
|
|
if ((sinfo->flags &
|
|
|
|
(BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_METADATA)) &&
|
|
|
|
!force)
|
|
|
|
min_allocable_bytes = 1 * 1024 * 1024;
|
|
|
|
else
|
|
|
|
min_allocable_bytes = 0;
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
spin_lock(&cache->lock);
|
2011-07-26 11:30:11 +08:00
|
|
|
|
|
|
|
if (cache->ro) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
num_bytes = cache->key.offset - cache->reserved - cache->pinned -
|
|
|
|
cache->bytes_super - btrfs_block_group_used(&cache->item);
|
|
|
|
|
|
|
|
if (sinfo->bytes_used + sinfo->bytes_reserved + sinfo->bytes_pinned +
|
2011-08-05 22:25:38 +08:00
|
|
|
sinfo->bytes_may_use + sinfo->bytes_readonly + num_bytes +
|
|
|
|
min_allocable_bytes <= sinfo->total_bytes) {
|
2010-05-16 22:46:25 +08:00
|
|
|
sinfo->bytes_readonly += num_bytes;
|
|
|
|
cache->ro = 1;
|
|
|
|
ret = 0;
|
|
|
|
}
|
2011-07-26 11:30:11 +08:00
|
|
|
out:
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&sinfo->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
int btrfs_set_block_group_ro(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
2008-07-23 11:06:41 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
u64 alloc_flags;
|
|
|
|
int ret;
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
BUG_ON(cache->ro);
|
2008-05-25 02:04:53 +08:00
|
|
|
|
2011-05-28 19:00:39 +08:00
|
|
|
trans = btrfs_join_transaction(root);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
alloc_flags = update_block_group_flags(root, cache->flags);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (alloc_flags != cache->flags) {
|
2012-09-13 02:08:47 +08:00
|
|
|
ret = do_chunk_alloc(trans, root, alloc_flags,
|
2012-03-12 23:03:00 +08:00
|
|
|
CHUNK_ALLOC_FORCE);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2011-07-15 18:34:36 +08:00
|
|
|
ret = set_block_group_ro(cache, 0);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (!ret)
|
|
|
|
goto out;
|
|
|
|
alloc_flags = get_alloc_profile(root, cache->space_info->flags);
|
2012-09-13 02:08:47 +08:00
|
|
|
ret = do_chunk_alloc(trans, root, alloc_flags,
|
2011-04-16 04:05:44 +08:00
|
|
|
CHUNK_ALLOC_FORCE);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2011-07-15 18:34:36 +08:00
|
|
|
ret = set_block_group_ro(cache, 0);
|
2010-05-16 22:46:25 +08:00
|
|
|
out:
|
|
|
|
btrfs_end_transaction(trans, root);
|
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2011-02-17 02:57:04 +08:00
|
|
|
int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 type)
|
|
|
|
{
|
|
|
|
u64 alloc_flags = get_alloc_profile(root, type);
|
2012-09-13 02:08:47 +08:00
|
|
|
return do_chunk_alloc(trans, root, alloc_flags,
|
2011-04-16 04:05:44 +08:00
|
|
|
CHUNK_ALLOC_FORCE);
|
2011-02-17 02:57:04 +08:00
|
|
|
}
|
|
|
|
|
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 18:07:31 +08:00
|
|
|
/*
|
|
|
|
* helper to account the unused space of all the readonly block group in the
|
|
|
|
* list. takes mirrors into account.
|
|
|
|
*/
|
|
|
|
static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
u64 free_bytes = 0;
|
|
|
|
int factor;
|
|
|
|
|
|
|
|
list_for_each_entry(block_group, groups_list, list) {
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
|
|
|
|
if (!block_group->ro) {
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_DUP))
|
|
|
|
factor = 2;
|
|
|
|
else
|
|
|
|
factor = 1;
|
|
|
|
|
|
|
|
free_bytes += (block_group->key.offset -
|
|
|
|
btrfs_block_group_used(&block_group->item)) *
|
|
|
|
factor;
|
|
|
|
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
return free_bytes;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* helper to account the unused space of all the readonly block group in the
|
|
|
|
* space_info. takes mirrors into account.
|
|
|
|
*/
|
|
|
|
u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
u64 free_bytes = 0;
|
|
|
|
|
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
|
|
|
|
for(i = 0; i < BTRFS_NR_RAID_TYPES; i++)
|
|
|
|
if (!list_empty(&sinfo->block_groups[i]))
|
|
|
|
free_bytes += __btrfs_get_ro_block_group_free_space(
|
|
|
|
&sinfo->block_groups[i]);
|
|
|
|
|
|
|
|
spin_unlock(&sinfo->lock);
|
|
|
|
|
|
|
|
return free_bytes;
|
|
|
|
}
|
|
|
|
|
2012-03-01 21:56:26 +08:00
|
|
|
void btrfs_set_block_group_rw(struct btrfs_root *root,
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_block_group_cache *cache)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_space_info *sinfo = cache->space_info;
|
|
|
|
u64 num_bytes;
|
|
|
|
|
|
|
|
BUG_ON(!cache->ro);
|
|
|
|
|
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
num_bytes = cache->key.offset - cache->reserved - cache->pinned -
|
|
|
|
cache->bytes_super - btrfs_block_group_used(&cache->item);
|
|
|
|
sinfo->bytes_readonly -= num_bytes;
|
|
|
|
cache->ro = 0;
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&sinfo->lock);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/*
|
|
|
|
* checks to see if its even possible to relocate this block group.
|
|
|
|
*
|
|
|
|
* @return - -1 if it's not a good idea to relocate this block group, 0 if its
|
|
|
|
* ok to go ahead and try.
|
|
|
|
*/
|
|
|
|
int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr)
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
{
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
|
|
|
|
struct btrfs_device *device;
|
2011-08-03 18:15:25 +08:00
|
|
|
u64 min_free;
|
2011-08-20 20:29:51 +08:00
|
|
|
u64 dev_min = 1;
|
|
|
|
u64 dev_nr = 0;
|
2012-03-27 22:09:17 +08:00
|
|
|
u64 target;
|
2011-08-03 18:15:25 +08:00
|
|
|
int index;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
int full = 0;
|
|
|
|
int ret = 0;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info, bytenr);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/* odd, couldn't find the block group, leave it alone */
|
|
|
|
if (!block_group)
|
|
|
|
return -1;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2011-08-03 18:15:25 +08:00
|
|
|
min_free = btrfs_block_group_used(&block_group->item);
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/* no bytes used, we're good */
|
2011-08-03 18:15:25 +08:00
|
|
|
if (!min_free)
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
goto out;
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
space_info = block_group->space_info;
|
|
|
|
spin_lock(&space_info->lock);
|
2008-12-12 23:03:38 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
full = space_info->full;
|
2008-12-12 23:03:38 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/*
|
|
|
|
* if this is the last block group we have in this space, we can't
|
2009-09-23 02:48:44 +08:00
|
|
|
* relocate it unless we're able to allocate a new chunk below.
|
|
|
|
*
|
|
|
|
* Otherwise, we need to make sure we have room in the space to handle
|
|
|
|
* all of the extents from this block group. If we can, we're good
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
*/
|
2009-09-23 02:48:44 +08:00
|
|
|
if ((space_info->total_bytes != block_group->key.offset) &&
|
2011-08-03 18:15:25 +08:00
|
|
|
(space_info->bytes_used + space_info->bytes_reserved +
|
|
|
|
space_info->bytes_pinned + space_info->bytes_readonly +
|
|
|
|
min_free < space_info->total_bytes)) {
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
goto out;
|
2008-12-12 23:03:38 +08:00
|
|
|
}
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2008-08-05 11:17:27 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/*
|
|
|
|
* ok we don't have enough space, but maybe we have free space on our
|
|
|
|
* devices to allocate new chunks for relocation, so loop through our
|
2012-03-27 22:09:17 +08:00
|
|
|
* alloc devices and guess if we have enough space. if this block
|
|
|
|
* group is going to be restriped, run checks against the target
|
|
|
|
* profile instead of the current one.
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
*/
|
|
|
|
ret = -1;
|
2008-08-05 11:17:27 +08:00
|
|
|
|
2011-08-03 18:15:25 +08:00
|
|
|
/*
|
|
|
|
* index:
|
|
|
|
* 0: raid10
|
|
|
|
* 1: raid1
|
|
|
|
* 2: dup
|
|
|
|
* 3: raid0
|
|
|
|
* 4: single
|
|
|
|
*/
|
2012-03-27 22:09:17 +08:00
|
|
|
target = get_restripe_target(root->fs_info, block_group->flags);
|
|
|
|
if (target) {
|
2012-11-21 22:18:10 +08:00
|
|
|
index = __get_raid_index(extended_to_chunk(target));
|
2012-03-27 22:09:17 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* this is just a balance, so if we were marked as full
|
|
|
|
* we know there is no space for a new chunk
|
|
|
|
*/
|
|
|
|
if (full)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
index = get_block_group_index(block_group);
|
|
|
|
}
|
|
|
|
|
2013-01-17 13:38:51 +08:00
|
|
|
if (index == BTRFS_RAID_RAID10) {
|
2011-08-03 18:15:25 +08:00
|
|
|
dev_min = 4;
|
2011-08-20 20:29:51 +08:00
|
|
|
/* Divide by 2 */
|
|
|
|
min_free >>= 1;
|
2013-01-17 13:38:51 +08:00
|
|
|
} else if (index == BTRFS_RAID_RAID1) {
|
2011-08-03 18:15:25 +08:00
|
|
|
dev_min = 2;
|
2013-01-17 13:38:51 +08:00
|
|
|
} else if (index == BTRFS_RAID_DUP) {
|
2011-08-20 20:29:51 +08:00
|
|
|
/* Multiply by 2 */
|
|
|
|
min_free <<= 1;
|
2013-01-17 13:38:51 +08:00
|
|
|
} else if (index == BTRFS_RAID_RAID0) {
|
2011-08-03 18:15:25 +08:00
|
|
|
dev_min = fs_devices->rw_devices;
|
2011-08-20 20:29:51 +08:00
|
|
|
do_div(min_free, dev_min);
|
2011-08-03 18:15:25 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
mutex_lock(&root->fs_info->chunk_mutex);
|
|
|
|
list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
|
2011-01-05 18:07:26 +08:00
|
|
|
u64 dev_offset;
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/*
|
|
|
|
* check to make sure we can actually find a chunk with enough
|
|
|
|
* space to fit our block group in.
|
|
|
|
*/
|
2012-11-06 01:29:28 +08:00
|
|
|
if (device->total_bytes > device->bytes_used + min_free &&
|
|
|
|
!device->is_tgtdev_for_dev_replace) {
|
2011-12-08 15:07:24 +08:00
|
|
|
ret = find_free_dev_extent(device, min_free,
|
2011-01-05 18:07:26 +08:00
|
|
|
&dev_offset, NULL);
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (!ret)
|
2011-08-03 18:15:25 +08:00
|
|
|
dev_nr++;
|
|
|
|
|
|
|
|
if (dev_nr >= dev_min)
|
2008-01-04 03:14:39 +08:00
|
|
|
break;
|
2011-08-03 18:15:25 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
ret = -1;
|
2008-01-05 05:47:16 +08:00
|
|
|
}
|
2007-12-22 05:27:24 +08:00
|
|
|
}
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
mutex_unlock(&root->fs_info->chunk_mutex);
|
2007-12-22 05:27:24 +08:00
|
|
|
out:
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2007-12-22 05:27:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-12-02 22:54:17 +08:00
|
|
|
static int find_first_block_group(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, struct btrfs_key *key)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2008-06-26 04:01:30 +08:00
|
|
|
int ret = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
int slot;
|
2007-12-22 05:27:24 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
2008-06-26 04:01:30 +08:00
|
|
|
goto out;
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2008-03-25 03:01:56 +08:00
|
|
|
slot = path->slots[0];
|
2007-12-22 05:27:24 +08:00
|
|
|
leaf = path->nodes[0];
|
2008-03-25 03:01:56 +08:00
|
|
|
if (slot >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret == 0)
|
|
|
|
continue;
|
|
|
|
if (ret < 0)
|
2008-06-26 04:01:30 +08:00
|
|
|
goto out;
|
2008-03-25 03:01:56 +08:00
|
|
|
break;
|
2007-12-22 05:27:24 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, slot);
|
2007-12-22 05:27:24 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
if (found_key.objectid >= key->objectid &&
|
2008-06-26 04:01:30 +08:00
|
|
|
found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
path->slots[0]++;
|
2007-12-22 05:27:24 +08:00
|
|
|
}
|
2008-06-26 04:01:30 +08:00
|
|
|
out:
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
2007-12-22 05:27:24 +08:00
|
|
|
}
|
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
void btrfs_put_block_group_cache(struct btrfs_fs_info *info)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
u64 last = 0;
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
struct inode *inode;
|
|
|
|
|
|
|
|
block_group = btrfs_lookup_first_block_group(info, last);
|
|
|
|
while (block_group) {
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->iref)
|
|
|
|
break;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
block_group = next_block_group(info->tree_root,
|
|
|
|
block_group);
|
|
|
|
}
|
|
|
|
if (!block_group) {
|
|
|
|
if (last == 0)
|
|
|
|
break;
|
|
|
|
last = 0;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
inode = block_group->inode;
|
|
|
|
block_group->iref = 0;
|
|
|
|
block_group->inode = NULL;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
iput(inode);
|
|
|
|
last = block_group->key.objectid + block_group->key.offset;
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
int btrfs_free_block_groups(struct btrfs_fs_info *info)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
2009-03-11 00:39:20 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_caching_control *caching_ctl;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
struct rb_node *n;
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
down_write(&info->extent_commit_sem);
|
|
|
|
while (!list_empty(&info->caching_block_groups)) {
|
|
|
|
caching_ctl = list_entry(info->caching_block_groups.next,
|
|
|
|
struct btrfs_caching_control, list);
|
|
|
|
list_del(&caching_ctl->list);
|
|
|
|
put_caching_control(caching_ctl);
|
|
|
|
}
|
|
|
|
up_write(&info->extent_commit_sem);
|
|
|
|
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
spin_lock(&info->block_group_cache_lock);
|
|
|
|
while ((n = rb_last(&info->block_group_cache_tree)) != NULL) {
|
|
|
|
block_group = rb_entry(n, struct btrfs_block_group_cache,
|
|
|
|
cache_node);
|
|
|
|
rb_erase(&block_group->cache_node,
|
|
|
|
&info->block_group_cache_tree);
|
2008-10-31 02:25:28 +08:00
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
|
|
|
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
down_write(&block_group->space_info->groups_sem);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
list_del(&block_group->list);
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
up_write(&block_group->space_info->groups_sem);
|
2008-12-12 05:30:39 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
if (block_group->cached == BTRFS_CACHE_STARTED)
|
2009-09-12 04:11:19 +08:00
|
|
|
wait_block_group_cache_done(block_group);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2011-02-02 23:53:47 +08:00
|
|
|
/*
|
|
|
|
* We haven't cached this block group, which means we could
|
|
|
|
* possibly have excluded extents on this block group.
|
|
|
|
*/
|
|
|
|
if (block_group->cached == BTRFS_CACHE_NO)
|
|
|
|
free_excluded_extents(info->extent_root, block_group);
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
btrfs_remove_free_space_cache(block_group);
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2008-10-31 02:25:28 +08:00
|
|
|
|
|
|
|
spin_lock(&info->block_group_cache_lock);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
}
|
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
2009-03-11 00:39:20 +08:00
|
|
|
|
|
|
|
/* now that all the block groups are freed, go through and
|
|
|
|
* free all the space_info structs. This is only called during
|
|
|
|
* the final stages of unmount, and so we know nobody is
|
|
|
|
* using them. We call synchronize_rcu() once before we start,
|
|
|
|
* just to be on the safe side.
|
|
|
|
*/
|
|
|
|
synchronize_rcu();
|
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
release_global_block_rsv(info);
|
|
|
|
|
2009-03-11 00:39:20 +08:00
|
|
|
while(!list_empty(&info->space_info)) {
|
|
|
|
space_info = list_entry(info->space_info.next,
|
|
|
|
struct btrfs_space_info,
|
|
|
|
list);
|
2013-02-09 05:28:17 +08:00
|
|
|
if (btrfs_test_opt(info->tree_root, ENOSPC_DEBUG)) {
|
|
|
|
if (space_info->bytes_pinned > 0 ||
|
|
|
|
space_info->bytes_reserved > 0 ||
|
|
|
|
space_info->bytes_may_use > 0) {
|
|
|
|
WARN_ON(1);
|
|
|
|
dump_space_info(space_info, 0, 0);
|
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2009-03-11 00:39:20 +08:00
|
|
|
list_del(&space_info->list);
|
|
|
|
kfree(space_info);
|
|
|
|
}
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
static void __link_block_group(struct btrfs_space_info *space_info,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
int index = get_block_group_index(cache);
|
|
|
|
|
|
|
|
down_write(&space_info->groups_sem);
|
|
|
|
list_add_tail(&cache->list, &space_info->block_groups[index]);
|
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
}
|
|
|
|
|
2007-04-27 04:46:15 +08:00
|
|
|
int btrfs_read_block_groups(struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
int ret;
|
|
|
|
struct btrfs_block_group_cache *cache;
|
2007-05-06 22:15:01 +08:00
|
|
|
struct btrfs_fs_info *info = root->fs_info;
|
2008-03-25 03:01:59 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2007-04-27 04:46:15 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *leaf;
|
2010-06-22 02:48:16 +08:00
|
|
|
int need_clear = 0;
|
|
|
|
u64 cache_gen;
|
2007-10-16 04:15:19 +08:00
|
|
|
|
2007-05-06 22:15:01 +08:00
|
|
|
root = info->extent_root;
|
2007-04-27 04:46:15 +08:00
|
|
|
key.objectid = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
key.offset = 0;
|
2007-04-27 04:46:15 +08:00
|
|
|
btrfs_set_key_type(&key, BTRFS_BLOCK_GROUP_ITEM_KEY);
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2011-05-13 22:32:11 +08:00
|
|
|
path->reada = 1;
|
2007-04-27 04:46:15 +08:00
|
|
|
|
2011-04-13 21:41:04 +08:00
|
|
|
cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
|
2011-10-04 02:07:49 +08:00
|
|
|
if (btrfs_test_opt(root, SPACE_CACHE) &&
|
2011-04-13 21:41:04 +08:00
|
|
|
btrfs_super_generation(root->fs_info->super_copy) != cache_gen)
|
2010-06-22 02:48:16 +08:00
|
|
|
need_clear = 1;
|
2010-09-22 02:21:34 +08:00
|
|
|
if (btrfs_test_opt(root, CLEAR_CACHE))
|
|
|
|
need_clear = 1;
|
2010-06-22 02:48:16 +08:00
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = find_first_block_group(root, path, &key);
|
2010-05-16 22:46:24 +08:00
|
|
|
if (ret > 0)
|
|
|
|
break;
|
2008-03-25 03:01:56 +08:00
|
|
|
if (ret != 0)
|
|
|
|
goto error;
|
2007-10-16 04:14:19 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
2008-04-26 04:53:30 +08:00
|
|
|
cache = kzalloc(sizeof(*cache), GFP_NOFS);
|
2007-04-27 04:46:15 +08:00
|
|
|
if (!cache) {
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = -ENOMEM;
|
2010-05-16 22:46:25 +08:00
|
|
|
goto error;
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2011-03-29 13:46:06 +08:00
|
|
|
cache->free_space_ctl = kzalloc(sizeof(*cache->free_space_ctl),
|
|
|
|
GFP_NOFS);
|
|
|
|
if (!cache->free_space_ctl) {
|
|
|
|
kfree(cache);
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto error;
|
|
|
|
}
|
2007-05-08 08:03:49 +08:00
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
atomic_set(&cache->count, 1);
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_lock_init(&cache->lock);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->fs_info = info;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
INIT_LIST_HEAD(&cache->list);
|
2009-04-03 21:47:43 +08:00
|
|
|
INIT_LIST_HEAD(&cache->cluster_list);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
Btrfs: fix a bug of writting free space cache during balance
Here is the whole story:
1)
A free space cache consists of two parts:
o free space cache inode, which is special becase it's stored in root tree.
o free space info, which is stored as the above inode's file data.
But we only build up another new inode and does not flush its free space info
onto disk when we _clear and setup_ free space cache, and this ends up with
that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.
And holding DC_SETUP means that we will not truncate this free space cache inode,
which means the disk offset of its file extent will remain _unchanged_ at least
until next transaction finishes committing itself.
2)
We can set a block group readonly when we relocate the block group.
However,
if the readonly block group covers the disk offset where our free space cache
inode is going to write, it will force the free space cache inode into
cow_file_range() and it'll end up hitting a BUG_ON.
3)
Due to the above analysis, we fix this bug by adding the missing dirty flag.
4)
However, it's not over, there is still another case, nospace_cache.
With nospace_cache, we do not want to set dirty flag, instead we just truncate
free space cache inode and bail out with setting cache state DC_WRITTEN.
We can benifit from it since it saves us another 'pre-allocation' part which
usually costs a lot.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-06 17:31:34 +08:00
|
|
|
if (need_clear) {
|
|
|
|
/*
|
|
|
|
* When we mount with old space cache, we need to
|
|
|
|
* set BTRFS_DC_CLEAR and set dirty flag.
|
|
|
|
*
|
|
|
|
* a) Setting 'BTRFS_DC_CLEAR' makes sure that we
|
|
|
|
* truncate the old free space cache inode and
|
|
|
|
* setup a new one.
|
|
|
|
* b) Setting 'dirty flag' makes sure that we flush
|
|
|
|
* the new space cache info onto disk.
|
|
|
|
*/
|
2010-06-22 02:48:16 +08:00
|
|
|
cache->disk_cache_state = BTRFS_DC_CLEAR;
|
Btrfs: fix a bug of writting free space cache during balance
Here is the whole story:
1)
A free space cache consists of two parts:
o free space cache inode, which is special becase it's stored in root tree.
o free space info, which is stored as the above inode's file data.
But we only build up another new inode and does not flush its free space info
onto disk when we _clear and setup_ free space cache, and this ends up with
that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.
And holding DC_SETUP means that we will not truncate this free space cache inode,
which means the disk offset of its file extent will remain _unchanged_ at least
until next transaction finishes committing itself.
2)
We can set a block group readonly when we relocate the block group.
However,
if the readonly block group covers the disk offset where our free space cache
inode is going to write, it will force the free space cache inode into
cow_file_range() and it'll end up hitting a BUG_ON.
3)
Due to the above analysis, we fix this bug by adding the missing dirty flag.
4)
However, it's not over, there is still another case, nospace_cache.
With nospace_cache, we do not want to set dirty flag, instead we just truncate
free space cache inode and bail out with setting cache state DC_WRITTEN.
We can benifit from it since it saves us another 'pre-allocation' part which
usually costs a lot.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-06 17:31:34 +08:00
|
|
|
if (btrfs_test_opt(root, SPACE_CACHE))
|
|
|
|
cache->dirty = 1;
|
|
|
|
}
|
2010-06-22 02:48:16 +08:00
|
|
|
|
2007-10-16 04:14:19 +08:00
|
|
|
read_extent_buffer(leaf, &cache->item,
|
|
|
|
btrfs_item_ptr_offset(leaf, path->slots[0]),
|
|
|
|
sizeof(cache->item));
|
2007-04-27 04:46:15 +08:00
|
|
|
memcpy(&cache->key, &found_key, sizeof(found_key));
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2007-04-27 04:46:15 +08:00
|
|
|
key.objectid = found_key.objectid + found_key.offset;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-03-25 03:01:56 +08:00
|
|
|
cache->flags = btrfs_block_group_flags(&cache->item);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->sectorsize = root->sectorsize;
|
2013-01-30 07:40:14 +08:00
|
|
|
cache->full_stripe_len = btrfs_full_stripe_len(root,
|
|
|
|
&root->fs_info->mapping_tree,
|
|
|
|
found_key.objectid);
|
2011-03-29 13:46:06 +08:00
|
|
|
btrfs_init_free_space_ctl(cache);
|
|
|
|
|
2011-02-02 23:53:47 +08:00
|
|
|
/*
|
|
|
|
* We need to exclude the super stripes now so that the space
|
|
|
|
* info has super bytes accounted for, otherwise we'll think
|
|
|
|
* we have more space than we actually do.
|
|
|
|
*/
|
2013-03-20 00:13:25 +08:00
|
|
|
ret = exclude_super_stripes(root, cache);
|
|
|
|
if (ret) {
|
|
|
|
/*
|
|
|
|
* We may have excluded something, so call this just in
|
|
|
|
* case.
|
|
|
|
*/
|
|
|
|
free_excluded_extents(root, cache);
|
|
|
|
kfree(cache->free_space_ctl);
|
|
|
|
kfree(cache);
|
|
|
|
goto error;
|
|
|
|
}
|
2011-02-02 23:53:47 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
/*
|
|
|
|
* check for two cases, either we are full, and therefore
|
|
|
|
* don't need to bother with the caching work since we won't
|
|
|
|
* find any space, or we are empty, and we can just add all
|
|
|
|
* the space in and be done with it. This saves us _alot_ of
|
|
|
|
* time, particularly in the full case.
|
|
|
|
*/
|
|
|
|
if (found_key.offset == btrfs_block_group_used(&cache->item)) {
|
2009-09-12 04:11:19 +08:00
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
2009-09-12 04:11:20 +08:00
|
|
|
free_excluded_extents(root, cache);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
} else if (btrfs_block_group_used(&cache->item) == 0) {
|
2009-09-12 04:11:19 +08:00
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
|
|
|
add_new_free_space(cache, root->fs_info,
|
|
|
|
found_key.objectid,
|
|
|
|
found_key.objectid +
|
|
|
|
found_key.offset);
|
2009-09-12 04:11:19 +08:00
|
|
|
free_excluded_extents(root, cache);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
2007-10-16 04:15:19 +08:00
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
ret = update_space_info(info, cache->flags, found_key.offset,
|
|
|
|
btrfs_block_group_used(&cache->item),
|
|
|
|
&space_info);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2008-03-25 03:01:59 +08:00
|
|
|
cache->space_info = space_info;
|
2009-09-12 04:11:20 +08:00
|
|
|
spin_lock(&cache->space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
cache->space_info->bytes_readonly += cache->bytes_super;
|
2009-09-12 04:11:20 +08:00
|
|
|
spin_unlock(&cache->space_info->lock);
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
__link_block_group(space_info, cache);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
|
|
|
ret = btrfs_add_block_group_cache(root->fs_info, cache);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* Logic error */
|
2008-10-01 07:24:06 +08:00
|
|
|
|
|
|
|
set_avail_alloc_bits(root->fs_info, cache->flags);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (btrfs_chunk_readonly(root, cache->key.objectid))
|
2011-07-15 18:34:36 +08:00
|
|
|
set_block_group_ro(cache, 1);
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2010-05-16 22:46:24 +08:00
|
|
|
|
|
|
|
list_for_each_entry_rcu(space_info, &root->fs_info->space_info, list) {
|
|
|
|
if (!(get_alloc_profile(root, space_info->flags) &
|
|
|
|
(BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
2013-01-30 07:40:14 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID5 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID6 |
|
2010-05-16 22:46:24 +08:00
|
|
|
BTRFS_BLOCK_GROUP_DUP)))
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* avoid allocating from un-mirrored block group if there are
|
|
|
|
* mirrored block groups.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(cache, &space_info->block_groups[3], list)
|
2011-07-15 18:34:36 +08:00
|
|
|
set_block_group_ro(cache, 1);
|
2010-05-16 22:46:24 +08:00
|
|
|
list_for_each_entry(cache, &space_info->block_groups[4], list)
|
2011-07-15 18:34:36 +08:00
|
|
|
set_block_group_ro(cache, 1);
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
|
|
|
|
init_global_block_rsv(info);
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = 0;
|
|
|
|
error:
|
2007-04-27 04:46:15 +08:00
|
|
|
btrfs_free_path(path);
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2008-03-25 03:01:59 +08:00
|
|
|
|
2012-09-12 04:57:25 +08:00
|
|
|
void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *block_group, *tmp;
|
|
|
|
struct btrfs_root *extent_root = root->fs_info->extent_root;
|
|
|
|
struct btrfs_block_group_item item;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(block_group, tmp, &trans->new_bgs,
|
|
|
|
new_bg_list) {
|
|
|
|
list_del_init(&block_group->new_bg_list);
|
|
|
|
|
|
|
|
if (ret)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
memcpy(&item, &block_group->item, sizeof(item));
|
|
|
|
memcpy(&key, &block_group->key, sizeof(key));
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
ret = btrfs_insert_item(trans, extent_root, &key, &item,
|
|
|
|
sizeof(item));
|
|
|
|
if (ret)
|
|
|
|
btrfs_abort_transaction(trans, extent_root, ret);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
int btrfs_make_block_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytes_used,
|
2008-04-16 03:41:47 +08:00
|
|
|
u64 type, u64 chunk_objectid, u64 chunk_offset,
|
2008-03-25 03:01:59 +08:00
|
|
|
u64 size)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_root *extent_root;
|
|
|
|
struct btrfs_block_group_cache *cache;
|
|
|
|
|
|
|
|
extent_root = root->fs_info->extent_root;
|
|
|
|
|
2009-03-24 22:24:20 +08:00
|
|
|
root->fs_info->last_trans_log_full_commit = trans->transid;
|
2008-09-06 04:13:11 +08:00
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
cache = kzalloc(sizeof(*cache), GFP_NOFS);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
if (!cache)
|
|
|
|
return -ENOMEM;
|
2011-03-29 13:46:06 +08:00
|
|
|
cache->free_space_ctl = kzalloc(sizeof(*cache->free_space_ctl),
|
|
|
|
GFP_NOFS);
|
|
|
|
if (!cache->free_space_ctl) {
|
|
|
|
kfree(cache);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2008-04-16 03:41:47 +08:00
|
|
|
cache->key.objectid = chunk_offset;
|
2008-03-25 03:01:59 +08:00
|
|
|
cache->key.offset = size;
|
2008-12-12 05:30:39 +08:00
|
|
|
cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->sectorsize = root->sectorsize;
|
2010-06-22 02:48:16 +08:00
|
|
|
cache->fs_info = root->fs_info;
|
2013-01-30 07:40:14 +08:00
|
|
|
cache->full_stripe_len = btrfs_full_stripe_len(root,
|
|
|
|
&root->fs_info->mapping_tree,
|
|
|
|
chunk_offset);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
atomic_set(&cache->count, 1);
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_lock_init(&cache->lock);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
INIT_LIST_HEAD(&cache->list);
|
2009-04-03 21:47:43 +08:00
|
|
|
INIT_LIST_HEAD(&cache->cluster_list);
|
2012-09-12 04:57:25 +08:00
|
|
|
INIT_LIST_HEAD(&cache->new_bg_list);
|
2008-05-25 02:04:53 +08:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
btrfs_init_free_space_ctl(cache);
|
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
btrfs_set_block_group_used(&cache->item, bytes_used);
|
|
|
|
btrfs_set_block_group_chunk_objectid(&cache->item, chunk_objectid);
|
|
|
|
cache->flags = type;
|
|
|
|
btrfs_set_block_group_flags(&cache->item, type);
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
2013-03-20 00:13:25 +08:00
|
|
|
ret = exclude_super_stripes(root, cache);
|
|
|
|
if (ret) {
|
|
|
|
/*
|
|
|
|
* We may have excluded something, so call this just in
|
|
|
|
* case.
|
|
|
|
*/
|
|
|
|
free_excluded_extents(root, cache);
|
|
|
|
kfree(cache->free_space_ctl);
|
|
|
|
kfree(cache);
|
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
add_new_free_space(cache, root->fs_info, chunk_offset,
|
|
|
|
chunk_offset + size);
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
free_excluded_extents(root, cache);
|
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
ret = update_space_info(root->fs_info, cache->flags, size, bytes_used,
|
|
|
|
&cache->space_info);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* -ENOMEM */
|
2011-12-07 10:39:22 +08:00
|
|
|
update_global_block_rsv(root->fs_info);
|
2009-09-12 04:11:20 +08:00
|
|
|
|
|
|
|
spin_lock(&cache->space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
cache->space_info->bytes_readonly += cache->bytes_super;
|
2009-09-12 04:11:20 +08:00
|
|
|
spin_unlock(&cache->space_info->lock);
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
__link_block_group(cache->space_info, cache);
|
2008-03-25 03:01:59 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
ret = btrfs_add_block_group_cache(root->fs_info, cache);
|
2012-03-12 23:03:00 +08:00
|
|
|
BUG_ON(ret); /* Logic error */
|
2008-07-23 11:06:41 +08:00
|
|
|
|
2012-09-12 04:57:25 +08:00
|
|
|
list_add_tail(&cache->new_bg_list, &trans->new_bgs);
|
2008-03-25 03:01:59 +08:00
|
|
|
|
2008-04-05 03:40:00 +08:00
|
|
|
set_avail_alloc_bits(extent_root->fs_info, type);
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
return 0;
|
|
|
|
}
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
|
|
|
|
{
|
2012-03-27 22:09:16 +08:00
|
|
|
u64 extra_flags = chunk_to_extended(flags) &
|
|
|
|
BTRFS_EXTENDED_PROFILE_MASK;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2013-01-29 18:13:12 +08:00
|
|
|
write_seqlock(&fs_info->profiles_lock);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
fs_info->avail_data_alloc_bits &= ~extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
fs_info->avail_metadata_alloc_bits &= ~extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
fs_info->avail_system_alloc_bits &= ~extra_flags;
|
2013-01-29 18:13:12 +08:00
|
|
|
write_sequnlock(&fs_info->profiles_lock);
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 group_start)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
2009-06-05 03:34:51 +08:00
|
|
|
struct btrfs_free_cluster *cluster;
|
2010-06-22 02:48:16 +08:00
|
|
|
struct btrfs_root *tree_root = root->fs_info->tree_root;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
struct btrfs_key key;
|
2010-06-22 02:48:16 +08:00
|
|
|
struct inode *inode;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
int ret;
|
2012-01-17 04:04:47 +08:00
|
|
|
int index;
|
2010-10-15 02:52:27 +08:00
|
|
|
int factor;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
|
|
|
root = root->fs_info->extent_root;
|
|
|
|
|
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info, group_start);
|
|
|
|
BUG_ON(!block_group);
|
2008-11-13 03:34:12 +08:00
|
|
|
BUG_ON(!block_group->ro);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2011-03-07 10:13:33 +08:00
|
|
|
/*
|
|
|
|
* Free the reserved super bytes from this block group before
|
|
|
|
* remove it.
|
|
|
|
*/
|
|
|
|
free_excluded_extents(root, block_group);
|
|
|
|
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
memcpy(&key, &block_group->key, sizeof(key));
|
2012-01-17 04:04:47 +08:00
|
|
|
index = get_block_group_index(block_group);
|
2010-10-15 02:52:27 +08:00
|
|
|
if (block_group->flags & (BTRFS_BLOCK_GROUP_DUP |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
factor = 2;
|
|
|
|
else
|
|
|
|
factor = 1;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2009-06-05 03:34:51 +08:00
|
|
|
/* make sure this block group isn't part of an allocation cluster */
|
|
|
|
cluster = &root->fs_info->data_alloc_cluster;
|
|
|
|
spin_lock(&cluster->refill_lock);
|
|
|
|
btrfs_return_cluster_to_free_space(block_group, cluster);
|
|
|
|
spin_unlock(&cluster->refill_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* make sure this block group isn't part of a metadata
|
|
|
|
* allocation cluster
|
|
|
|
*/
|
|
|
|
cluster = &root->fs_info->meta_alloc_cluster;
|
|
|
|
spin_lock(&cluster->refill_lock);
|
|
|
|
btrfs_return_cluster_to_free_space(block_group, cluster);
|
|
|
|
spin_unlock(&cluster->refill_lock);
|
|
|
|
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
path = btrfs_alloc_path();
|
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 01:38:47 +08:00
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2011-10-02 18:56:53 +08:00
|
|
|
inode = lookup_free_space_inode(tree_root, block_group, path);
|
2010-06-22 02:48:16 +08:00
|
|
|
if (!IS_ERR(inode)) {
|
2011-07-19 15:27:20 +08:00
|
|
|
ret = btrfs_orphan_add(trans, inode);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_add_delayed_iput(inode);
|
|
|
|
goto out;
|
|
|
|
}
|
2010-06-22 02:48:16 +08:00
|
|
|
clear_nlink(inode);
|
|
|
|
/* One for the block groups ref */
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->iref) {
|
|
|
|
block_group->iref = 0;
|
|
|
|
block_group->inode = NULL;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
iput(inode);
|
|
|
|
} else {
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
}
|
|
|
|
/* One for our lookup ref */
|
2011-09-20 00:26:24 +08:00
|
|
|
btrfs_add_delayed_iput(inode);
|
2010-06-22 02:48:16 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = BTRFS_FREE_SPACE_OBJECTID;
|
|
|
|
key.offset = block_group->key.objectid;
|
|
|
|
key.type = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, tree_root, &key, path, -1, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0)
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-06-22 02:48:16 +08:00
|
|
|
if (ret == 0) {
|
|
|
|
ret = btrfs_del_item(trans, tree_root, path);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-06-22 02:48:16 +08:00
|
|
|
}
|
|
|
|
|
2009-01-21 23:49:16 +08:00
|
|
|
spin_lock(&root->fs_info->block_group_cache_lock);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
rb_erase(&block_group->cache_node,
|
|
|
|
&root->fs_info->block_group_cache_tree);
|
2012-12-27 17:01:23 +08:00
|
|
|
|
|
|
|
if (root->fs_info->first_logical_byte == block_group->key.objectid)
|
|
|
|
root->fs_info->first_logical_byte = (u64)-1;
|
2009-01-21 23:49:16 +08:00
|
|
|
spin_unlock(&root->fs_info->block_group_cache_lock);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
down_write(&block_group->space_info->groups_sem);
|
2009-06-05 03:34:51 +08:00
|
|
|
/*
|
|
|
|
* we must use list_del_init so people can check to see if they
|
|
|
|
* are still on the list after taking the semaphore
|
|
|
|
*/
|
|
|
|
list_del_init(&block_group->list);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (list_empty(&block_group->space_info->block_groups[index]))
|
|
|
|
clear_avail_alloc_bits(root->fs_info, block_group->flags);
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
up_write(&block_group->space_info->groups_sem);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
if (block_group->cached == BTRFS_CACHE_STARTED)
|
2009-09-12 04:11:19 +08:00
|
|
|
wait_block_group_cache_done(block_group);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
|
|
|
btrfs_remove_free_space_cache(block_group);
|
|
|
|
|
2008-11-13 03:34:12 +08:00
|
|
|
spin_lock(&block_group->space_info->lock);
|
|
|
|
block_group->space_info->total_bytes -= block_group->key.offset;
|
|
|
|
block_group->space_info->bytes_readonly -= block_group->key.offset;
|
2010-10-15 02:52:27 +08:00
|
|
|
block_group->space_info->disk_total -= block_group->key.offset * factor;
|
2008-11-13 03:34:12 +08:00
|
|
|
spin_unlock(&block_group->space_info->lock);
|
2009-07-25 04:30:55 +08:00
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
memcpy(&key, &block_group->key, sizeof(key));
|
|
|
|
|
2009-07-25 04:30:55 +08:00
|
|
|
btrfs_clear_space_info_full(root->fs_info);
|
2008-11-13 03:34:12 +08:00
|
|
|
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
btrfs_put_block_group(block_group);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -EIO;
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
2011-01-06 19:30:25 +08:00
|
|
|
|
2011-03-07 10:13:14 +08:00
|
|
|
int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *space_info;
|
2011-04-08 16:44:37 +08:00
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
u64 features;
|
|
|
|
u64 flags;
|
|
|
|
int mixed = 0;
|
2011-03-07 10:13:14 +08:00
|
|
|
int ret;
|
|
|
|
|
2011-04-13 21:41:04 +08:00
|
|
|
disk_super = fs_info->super_copy;
|
2011-04-08 16:44:37 +08:00
|
|
|
if (!btrfs_super_root(disk_super))
|
|
|
|
return 1;
|
2011-03-07 10:13:14 +08:00
|
|
|
|
2011-04-08 16:44:37 +08:00
|
|
|
features = btrfs_super_incompat_flags(disk_super);
|
|
|
|
if (features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS)
|
|
|
|
mixed = 1;
|
2011-03-07 10:13:14 +08:00
|
|
|
|
2011-04-08 16:44:37 +08:00
|
|
|
flags = BTRFS_BLOCK_GROUP_SYSTEM;
|
|
|
|
ret = update_space_info(fs_info, flags, 0, 0, &space_info);
|
2011-03-07 10:13:14 +08:00
|
|
|
if (ret)
|
2011-04-08 16:44:37 +08:00
|
|
|
goto out;
|
2011-03-07 10:13:14 +08:00
|
|
|
|
2011-04-08 16:44:37 +08:00
|
|
|
if (mixed) {
|
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
ret = update_space_info(fs_info, flags, 0, 0, &space_info);
|
|
|
|
} else {
|
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA;
|
|
|
|
ret = update_space_info(fs_info, flags, 0, 0, &space_info);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
flags = BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
ret = update_space_info(fs_info, flags, 0, 0, &space_info);
|
|
|
|
}
|
|
|
|
out:
|
2011-03-07 10:13:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-01-06 19:30:25 +08:00
|
|
|
int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
|
|
|
|
{
|
|
|
|
return unpin_extent_range(root, start, end);
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
|
2011-03-24 18:24:27 +08:00
|
|
|
u64 num_bytes, u64 *actual_bytes)
|
2011-01-06 19:30:25 +08:00
|
|
|
{
|
2011-03-24 18:24:27 +08:00
|
|
|
return btrfs_discard_extent(root, bytenr, num_bytes, actual_bytes);
|
2011-01-06 19:30:25 +08:00
|
|
|
}
|
2011-03-24 18:24:28 +08:00
|
|
|
|
|
|
|
int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct btrfs_block_group_cache *cache = NULL;
|
|
|
|
u64 group_trimmed;
|
|
|
|
u64 start;
|
|
|
|
u64 end;
|
|
|
|
u64 trimmed = 0;
|
2012-02-09 18:17:41 +08:00
|
|
|
u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
|
2011-03-24 18:24:28 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2012-02-09 18:17:41 +08:00
|
|
|
/*
|
|
|
|
* try to trim all FS space, our block group may start from non-zero.
|
|
|
|
*/
|
|
|
|
if (range->len == total_bytes)
|
|
|
|
cache = btrfs_lookup_first_block_group(fs_info, range->start);
|
|
|
|
else
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, range->start);
|
2011-03-24 18:24:28 +08:00
|
|
|
|
|
|
|
while (cache) {
|
|
|
|
if (cache->key.objectid >= (range->start + range->len)) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
start = max(range->start, cache->key.objectid);
|
|
|
|
end = min(range->start + range->len,
|
|
|
|
cache->key.objectid + cache->key.offset);
|
|
|
|
|
|
|
|
if (end - start >= range->minlen) {
|
|
|
|
if (!block_group_cache_done(cache)) {
|
2012-12-27 17:01:18 +08:00
|
|
|
ret = cache_block_group(cache, 0);
|
2011-03-24 18:24:28 +08:00
|
|
|
if (!ret)
|
|
|
|
wait_block_group_cache_done(cache);
|
|
|
|
}
|
|
|
|
ret = btrfs_trim_block_group(cache,
|
|
|
|
&group_trimmed,
|
|
|
|
start,
|
|
|
|
end,
|
|
|
|
range->minlen);
|
|
|
|
|
|
|
|
trimmed += group_trimmed;
|
|
|
|
if (ret) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
cache = next_block_group(fs_info->tree_root, cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
range->len = trimmed;
|
|
|
|
return ret;
|
|
|
|
}
|