linux/fs/btrfs/delayed-ref.h

429 lines
12 KiB
C
Raw Normal View History

/* SPDX-License-Identifier: GPL-2.0 */
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
/*
* Copyright (C) 2008 Oracle. All rights reserved.
*/
#ifndef BTRFS_DELAYED_REF_H
#define BTRFS_DELAYED_REF_H
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
#include <linux/types.h>
#include <linux/refcount.h>
#include <linux/list.h>
#include <linux/rbtree.h>
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/slab.h>
#include <uapi/linux/btrfs_tree.h>
struct btrfs_trans_handle;
struct btrfs_fs_info;
/* these are the possible values of struct btrfs_delayed_ref_node->action */
enum btrfs_delayed_ref_action {
/* Add one backref to the tree */
BTRFS_ADD_DELAYED_REF = 1,
/* Delete one backref from the tree */
BTRFS_DROP_DELAYED_REF,
/* Record a full extent allocation */
BTRFS_ADD_DELAYED_EXTENT,
/* Not changing ref count on head ref */
BTRFS_UPDATE_DELAYED_HEAD,
} __packed;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
struct btrfs_data_ref {
/* For EXTENT_DATA_REF */
/* Inode which refers to this data extent */
u64 objectid;
/*
* file_offset - extent_offset
*
* file_offset is the key.offset of the EXTENT_DATA key.
* extent_offset is btrfs_file_extent_offset() of the EXTENT_DATA data.
*/
u64 offset;
};
struct btrfs_tree_ref {
/*
* Level of this tree block.
*
* Shared for skinny (TREE_BLOCK_REF) and normal tree ref.
*/
int level;
/* For non-skinny metadata, no special member needed */
};
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
struct btrfs_delayed_ref_node {
struct rb_node ref_node;
btrfs: improve delayed refs iterations This issue was found when I tried to delete a heavily reflinked file, when deleting such files, other transaction operation will not have a chance to make progress, for example, start_transaction() will blocked in wait_current_trans(root) for long time, sometimes it even triggers soft lockups, and the time taken to delete such heavily reflinked file is also very large, often hundreds of seconds. Using perf top, it reports that: PerfTop: 7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs) --------------------------------------------------------------------------------------- 84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80 11.02% [kernel] [k] delay_tsc 0.79% [kernel] [k] _raw_spin_unlock_irq 0.78% [kernel] [k] _raw_spin_unlock_irqrestore 0.45% [kernel] [k] do_raw_spin_lock 0.18% [kernel] [k] __slab_alloc It seems __btrfs_run_delayed_refs() took most cpu time, after some debug work, I found it's select_delayed_ref() causing this issue, for a delayed head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but select_delayed_ref() will firstly try to iterate node list to find BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and waste much time. To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head, then in select_delayed_ref(), if this list is not empty, we can directly use nodes in this list. With this patch, it just took about 10~15 seconds to delte the same file. Now using perf top, it reports that: PerfTop: 2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs) ---------------------------------------------------------------------------------------- 20.74% [kernel] [k] _raw_spin_unlock_irqrestore 16.33% [kernel] [k] __slab_alloc 5.41% [kernel] [k] lock_acquired 4.42% [kernel] [k] lock_acquire 4.05% [kernel] [k] lock_release 3.37% [kernel] [k] _raw_spin_unlock_irq For normal files, this patch also gives help, at least we do not need to iterate whole list to found BTRFS_ADD_DELAYED_REF nodes. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-26 18:07:33 +08:00
/*
* If action is BTRFS_ADD_DELAYED_REF, also link this node to
* ref_head->ref_add_list, then we do not need to iterate the
* whole ref_head->ref_list to find BTRFS_ADD_DELAYED_REF nodes.
*/
struct list_head add_list;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
/* the starting bytenr of the extent */
u64 bytenr;
/* the size of the extent */
u64 num_bytes;
/* seq number to keep track of insertion order */
u64 seq;
/* The ref_root for this ref */
u64 ref_root;
/*
* The parent for this ref, if this isn't set the ref_root is the
* reference owner.
*/
u64 parent;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
/* ref count on this data structure */
refcount_t refs;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
/*
* how many refs is this entry adding or deleting. For
* head refs, this may be a negative number because it is keeping
* track of the total mods done to the reference count.
* For individual refs, this will always be a positive number
*
* It may be more than one, since it is possible for a single
* parent to have more than one ref on an extent
*/
int ref_mod;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
unsigned int action:8;
unsigned int type:8;
union {
struct btrfs_tree_ref tree_ref;
struct btrfs_data_ref data_ref;
};
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
};
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
struct btrfs_delayed_extent_op {
struct btrfs_disk_key key;
u8 level;
bool update_key;
bool update_flags;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
u64 flags_to_set;
};
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
/*
* the head refs are used to hold a lock on a given extent, which allows us
* to make sure that only one process is running the delayed refs
* at a time for a single extent. They also store the sum of all the
* reference count modifications we've queued up.
*/
struct btrfs_delayed_ref_head {
u64 bytenr;
u64 num_bytes;
btrfs: reorder some members of struct btrfs_delayed_ref_head Currently struct delayed_ref_head has its 'bytenr' and 'href_node' members in different cache lines (even on a release, non-debug, kernel). This is not optimal because when iterating the red black tree of delayed ref heads for inserting a new delayed ref head (htree_insert()) we have to pull in 2 cache lines of delayed ref heads we find in a patch, one for the tree node (struct rb_node) and another one for the 'bytenr' field. The same applies when searching for an existing delayed ref head (find_ref_head()). On a release (non-debug) kernel, the structure also has two 4 bytes holes, which makes it 8 bytes longer than necessary. Its current layout is the following: struct btrfs_delayed_ref_head { u64 bytenr; /* 0 8 */ u64 num_bytes; /* 8 8 */ refcount_t refs; /* 16 4 */ /* XXX 4 bytes hole, try to pack */ struct mutex mutex; /* 24 32 */ spinlock_t lock; /* 56 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ struct rb_root_cached ref_tree; /* 64 16 */ struct list_head ref_add_list; /* 80 16 */ struct rb_node href_node __attribute__((__aligned__(8))); /* 96 24 */ struct btrfs_delayed_extent_op * extent_op; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ int total_ref_mod; /* 128 4 */ int ref_mod; /* 132 4 */ unsigned int must_insert_reserved:1; /* 136: 0 4 */ unsigned int is_data:1; /* 136: 1 4 */ unsigned int is_system:1; /* 136: 2 4 */ unsigned int processing:1; /* 136: 3 4 */ /* size: 144, cachelines: 3, members: 15 */ /* sum members: 128, holes: 2, sum holes: 8 */ /* sum bitfield members: 4 bits (0 bytes) */ /* padding: 4 */ /* bit_padding: 28 bits */ /* forced alignments: 1 */ /* last cacheline: 16 bytes */ } __attribute__((__aligned__(8))); This change reorders the 'href_node' and 'refs' members so that we have the 'href_node' in the same cache line as the 'bytenr' field, while also eliminating the two holes and reducing the structure size from 144 bytes down to 136 bytes, so we can now have 30 ref heads per 4K page (on x86_64) instead of 28. The new structure layout after this change is now: struct btrfs_delayed_ref_head { u64 bytenr; /* 0 8 */ u64 num_bytes; /* 8 8 */ struct rb_node href_node __attribute__((__aligned__(8))); /* 16 24 */ struct mutex mutex; /* 40 32 */ /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ refcount_t refs; /* 72 4 */ spinlock_t lock; /* 76 4 */ struct rb_root_cached ref_tree; /* 80 16 */ struct list_head ref_add_list; /* 96 16 */ struct btrfs_delayed_extent_op * extent_op; /* 112 8 */ int total_ref_mod; /* 120 4 */ int ref_mod; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int must_insert_reserved:1; /* 128: 0 4 */ unsigned int is_data:1; /* 128: 1 4 */ unsigned int is_system:1; /* 128: 2 4 */ unsigned int processing:1; /* 128: 3 4 */ /* size: 136, cachelines: 3, members: 15 */ /* padding: 4 */ /* bit_padding: 28 bits */ /* forced alignments: 1 */ /* last cacheline: 8 bytes */ } __attribute__((__aligned__(8))); Running the following fs_mark test shows some significant improvement. $ cat test.sh #!/bin/bash # 15G null block device DEV=/dev/nullb0 MNT=/mnt/nullb0 FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 112631.3 11928055 16 2400000 0 189943.8 12140777 23 3600000 0 150719.2 13178480 50 4800000 0 99137.3 12504293 53 6000000 0 111733.9 12670836 Total files/sec: 664165.5 After this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 148589.5 11565889 16 2400000 0 227743.8 11561596 23 3600000 0 191590.5 12550755 30 4800000 0 179812.3 12629610 53 6000000 0 92471.4 12352383 Total files/sec: 840207.5 Measuring the execution times of htree_insert(), in nanoseconds, during those fs_mark runs: Before this change: Range: 0.000 - 940647.000; Mean: 619.733; Median: 548.000; Stddev: 1834.231 Percentiles: 90th: 980.000; 95th: 1208.000; 99th: 2090.000 0.000 - 6.384: 257 | 6.384 - 26.259: 977 | 26.259 - 99.635: 4963 | 99.635 - 370.526: 136800 ############# 370.526 - 1370.603: 566110 ##################################################### 1370.603 - 5062.704: 24945 ## 5062.704 - 18693.248: 944 | 18693.248 - 69014.670: 211 | 69014.670 - 254791.959: 30 | 254791.959 - 940647.000: 4 | After this change: Range: 0.000 - 299200.000; Mean: 587.754; Median: 542.000; Stddev: 1030.422 Percentiles: 90th: 918.000; 95th: 1113.000; 99th: 1987.000 0.000 - 5.585: 163 | 5.585 - 20.678: 452 | 20.678 - 70.369: 1806 | 70.369 - 233.965: 26268 #### 233.965 - 772.564: 333519 ##################################################### 772.564 - 2545.771: 91820 ############### 2545.771 - 8383.615: 2238 | 8383.615 - 27603.280: 170 | 27603.280 - 90879.297: 68 | 90879.297 - 299200.000: 12 | Mean, percentiles, maximum times are all better, as well as a lower standard deviation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-29 23:16:57 +08:00
/*
* For insertion into struct btrfs_delayed_ref_root::href_root.
* Keep it in the same cache line as 'bytenr' for more efficient
* searches in the rbtree.
*/
struct rb_node href_node;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
/*
* the mutex is held while running the refs, and it is also
* held when checking the sum of reference modifications.
*/
struct mutex mutex;
btrfs: reorder some members of struct btrfs_delayed_ref_head Currently struct delayed_ref_head has its 'bytenr' and 'href_node' members in different cache lines (even on a release, non-debug, kernel). This is not optimal because when iterating the red black tree of delayed ref heads for inserting a new delayed ref head (htree_insert()) we have to pull in 2 cache lines of delayed ref heads we find in a patch, one for the tree node (struct rb_node) and another one for the 'bytenr' field. The same applies when searching for an existing delayed ref head (find_ref_head()). On a release (non-debug) kernel, the structure also has two 4 bytes holes, which makes it 8 bytes longer than necessary. Its current layout is the following: struct btrfs_delayed_ref_head { u64 bytenr; /* 0 8 */ u64 num_bytes; /* 8 8 */ refcount_t refs; /* 16 4 */ /* XXX 4 bytes hole, try to pack */ struct mutex mutex; /* 24 32 */ spinlock_t lock; /* 56 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ struct rb_root_cached ref_tree; /* 64 16 */ struct list_head ref_add_list; /* 80 16 */ struct rb_node href_node __attribute__((__aligned__(8))); /* 96 24 */ struct btrfs_delayed_extent_op * extent_op; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ int total_ref_mod; /* 128 4 */ int ref_mod; /* 132 4 */ unsigned int must_insert_reserved:1; /* 136: 0 4 */ unsigned int is_data:1; /* 136: 1 4 */ unsigned int is_system:1; /* 136: 2 4 */ unsigned int processing:1; /* 136: 3 4 */ /* size: 144, cachelines: 3, members: 15 */ /* sum members: 128, holes: 2, sum holes: 8 */ /* sum bitfield members: 4 bits (0 bytes) */ /* padding: 4 */ /* bit_padding: 28 bits */ /* forced alignments: 1 */ /* last cacheline: 16 bytes */ } __attribute__((__aligned__(8))); This change reorders the 'href_node' and 'refs' members so that we have the 'href_node' in the same cache line as the 'bytenr' field, while also eliminating the two holes and reducing the structure size from 144 bytes down to 136 bytes, so we can now have 30 ref heads per 4K page (on x86_64) instead of 28. The new structure layout after this change is now: struct btrfs_delayed_ref_head { u64 bytenr; /* 0 8 */ u64 num_bytes; /* 8 8 */ struct rb_node href_node __attribute__((__aligned__(8))); /* 16 24 */ struct mutex mutex; /* 40 32 */ /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ refcount_t refs; /* 72 4 */ spinlock_t lock; /* 76 4 */ struct rb_root_cached ref_tree; /* 80 16 */ struct list_head ref_add_list; /* 96 16 */ struct btrfs_delayed_extent_op * extent_op; /* 112 8 */ int total_ref_mod; /* 120 4 */ int ref_mod; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ unsigned int must_insert_reserved:1; /* 128: 0 4 */ unsigned int is_data:1; /* 128: 1 4 */ unsigned int is_system:1; /* 128: 2 4 */ unsigned int processing:1; /* 128: 3 4 */ /* size: 136, cachelines: 3, members: 15 */ /* padding: 4 */ /* bit_padding: 28 bits */ /* forced alignments: 1 */ /* last cacheline: 8 bytes */ } __attribute__((__aligned__(8))); Running the following fs_mark test shows some significant improvement. $ cat test.sh #!/bin/bash # 15G null block device DEV=/dev/nullb0 MNT=/mnt/nullb0 FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 112631.3 11928055 16 2400000 0 189943.8 12140777 23 3600000 0 150719.2 13178480 50 4800000 0 99137.3 12504293 53 6000000 0 111733.9 12670836 Total files/sec: 664165.5 After this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 148589.5 11565889 16 2400000 0 227743.8 11561596 23 3600000 0 191590.5 12550755 30 4800000 0 179812.3 12629610 53 6000000 0 92471.4 12352383 Total files/sec: 840207.5 Measuring the execution times of htree_insert(), in nanoseconds, during those fs_mark runs: Before this change: Range: 0.000 - 940647.000; Mean: 619.733; Median: 548.000; Stddev: 1834.231 Percentiles: 90th: 980.000; 95th: 1208.000; 99th: 2090.000 0.000 - 6.384: 257 | 6.384 - 26.259: 977 | 26.259 - 99.635: 4963 | 99.635 - 370.526: 136800 ############# 370.526 - 1370.603: 566110 ##################################################### 1370.603 - 5062.704: 24945 ## 5062.704 - 18693.248: 944 | 18693.248 - 69014.670: 211 | 69014.670 - 254791.959: 30 | 254791.959 - 940647.000: 4 | After this change: Range: 0.000 - 299200.000; Mean: 587.754; Median: 542.000; Stddev: 1030.422 Percentiles: 90th: 918.000; 95th: 1113.000; 99th: 1987.000 0.000 - 5.585: 163 | 5.585 - 20.678: 452 | 20.678 - 70.369: 1806 | 70.369 - 233.965: 26268 #### 233.965 - 772.564: 333519 ##################################################### 772.564 - 2545.771: 91820 ############### 2545.771 - 8383.615: 2238 | 8383.615 - 27603.280: 170 | 27603.280 - 90879.297: 68 | 90879.297 - 299200.000: 12 | Mean, percentiles, maximum times are all better, as well as a lower standard deviation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-29 23:16:57 +08:00
refcount_t refs;
/* Protects 'ref_tree' and 'ref_add_list'. */
spinlock_t lock;
struct rb_root_cached ref_tree;
btrfs: improve delayed refs iterations This issue was found when I tried to delete a heavily reflinked file, when deleting such files, other transaction operation will not have a chance to make progress, for example, start_transaction() will blocked in wait_current_trans(root) for long time, sometimes it even triggers soft lockups, and the time taken to delete such heavily reflinked file is also very large, often hundreds of seconds. Using perf top, it reports that: PerfTop: 7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs) --------------------------------------------------------------------------------------- 84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80 11.02% [kernel] [k] delay_tsc 0.79% [kernel] [k] _raw_spin_unlock_irq 0.78% [kernel] [k] _raw_spin_unlock_irqrestore 0.45% [kernel] [k] do_raw_spin_lock 0.18% [kernel] [k] __slab_alloc It seems __btrfs_run_delayed_refs() took most cpu time, after some debug work, I found it's select_delayed_ref() causing this issue, for a delayed head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but select_delayed_ref() will firstly try to iterate node list to find BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and waste much time. To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head, then in select_delayed_ref(), if this list is not empty, we can directly use nodes in this list. With this patch, it just took about 10~15 seconds to delte the same file. Now using perf top, it reports that: PerfTop: 2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs) ---------------------------------------------------------------------------------------- 20.74% [kernel] [k] _raw_spin_unlock_irqrestore 16.33% [kernel] [k] __slab_alloc 5.41% [kernel] [k] lock_acquired 4.42% [kernel] [k] lock_acquire 4.05% [kernel] [k] lock_release 3.37% [kernel] [k] _raw_spin_unlock_irq For normal files, this patch also gives help, at least we do not need to iterate whole list to found BTRFS_ADD_DELAYED_REF nodes. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-26 18:07:33 +08:00
/* accumulate add BTRFS_ADD_DELAYED_REF nodes to this ref_add_list. */
struct list_head ref_add_list;
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
struct btrfs_delayed_extent_op *extent_op;
/*
* This is used to track the final ref_mod from all the refs associated
* with this head ref, this is not adjusted as delayed refs are run,
* this is meant to track if we need to do the csum accounting or not.
*/
int total_ref_mod;
/*
* This is the current outstanding mod references for this bytenr. This
* is used with lookup_extent_info to get an accurate reference count
* for a bytenr, so it is adjusted as delayed refs are run so that any
* on disk reference count + ref_mod is accurate.
*/
int ref_mod;
/*
* The root that triggered the allocation when must_insert_reserved is
* set to true.
*/
u64 owning_root;
/*
* Track reserved bytes when setting must_insert_reserved. On success
* or cleanup, we will need to free the reservation.
*/
u64 reserved_bytes;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
/*
* when a new extent is allocated, it is just reserved in memory
* The actual extent isn't inserted into the extent allocation tree
* until the delayed ref is processed. must_insert_reserved is
* used to flag a delayed ref so the accounting can be updated
* when a full insert is done.
*
* It is possible the extent will be freed before it is ever
* inserted into the extent allocation tree. In this case
* we need to update the in ram accounting to properly reflect
* the free has happened.
*/
bool must_insert_reserved;
bool is_data;
bool is_system;
bool processing;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
};
btrfs: only let one thread pre-flush delayed refs in commit I've been running a stress test that runs 20 workers in their own subvolume, which are running an fsstress instance with 4 threads per worker, which is 80 total fsstress threads. In addition to this I'm running balance in the background as well as creating and deleting snapshots. This test takes around 12 hours to run normally, going slower and slower as the test goes on. The reason for this is because fsstress is running fsync sometimes, and because we're messing with block groups we often fall through to btrfs_commit_transaction, so will often have 20-30 threads all calling btrfs_commit_transaction at the same time. These all get stuck contending on the extent tree while they try to run delayed refs during the initial part of the commit. This is suboptimal, really because the extent tree is a single point of failure we only want one thread acting on that tree at once to reduce lock contention. Fix this by making the flushing mechanism a bit operation, to make it easy to use test_and_set_bit() in order to make sure only one task does this initial flush. Once we're into the transaction commit we only have one thread doing delayed ref running, it's just this initial pre-flush that is problematic. With this patch my stress test takes around 90 minutes to run, instead of 12 hours. The memory barrier is not necessary for the flushing bit as it's ordered, unlike plain int. The transaction state accessed in btrfs_should_end_transaction could be affected by that too as it's not always used under transaction lock. Upon Nikolay's analysis in [1] it's not necessary: In should_end_transaction it's read without holding any locks. (U) It's modified in btrfs_cleanup_transaction without holding the fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set. set in cleanup_transaction under fs_info->trans_lock (L) set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this point the transaction is finished and fs_info->running_trans is NULL (U but irrelevant). So by the looks of it we can have a concurrent READ race with a WRITE, due to reads not taking a lock. In this case what we want to ensure is we either see new or old state. I consulted with Will Deacon and he said that in such a case we'd want to annotate the accesses to ->state with (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I don't think this could happen but I imagine at some point KCSAN would flag such an access as racy (which it is). [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comments regarding memory barrier ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-19 03:24:20 +08:00
enum btrfs_delayed_ref_flags {
/* Indicate that we are flushing delayed refs for the commit */
BTRFS_DELAYED_REFS_FLUSHING,
};
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
struct btrfs_delayed_ref_root {
/* head ref rbtree */
Btrfs: delayed-refs: use rb_first_cached for href_root rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). Functions manipulating href_root need to get the first entry, this converts href_root to use rb_first_cached(). This patch is first in the sequenct of similar updates to other rbtrees and this is analysis of the expected behaviour and improvements. There's a common pattern: while (node = rb_first) { entry = rb_entry(node) next = rb_next(node) rb_erase(node) cleanup(entry) } rb_first needs to traverse the tree up to logN depth, rb_erase can completely reshuffle the tree. With the caching we'll skip the traversal in rb_first. That's a cached memory access vs looped pointer dereference trade-off that IMHO has a clear winner. Measurements show there's not much difference in a sample tree with 10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of caching and pointer chasing are unpredictable though. Further optimzations can be done to avoid the expensive rb_erase step. In some cases it's ok to process the nodes in any order, so the tree can be traversed in post-order, not rebalancing the children nodes and just calling free. Care must be taken regarding the next node. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog from mail discussions ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 03:51:49 +08:00
struct rb_root_cached href_root;
/* dirty extent records */
struct rb_root dirty_extent_root;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
/* this spin lock protects the rbtree and the entries inside */
spinlock_t lock;
/* how many delayed ref updates we've queued, used by the
* throttling code
*/
atomic_t num_entries;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
/* total number of head nodes in tree */
unsigned long num_heads;
/* total number of head nodes ready for processing */
unsigned long num_heads_ready;
u64 pending_csums;
btrfs: only let one thread pre-flush delayed refs in commit I've been running a stress test that runs 20 workers in their own subvolume, which are running an fsstress instance with 4 threads per worker, which is 80 total fsstress threads. In addition to this I'm running balance in the background as well as creating and deleting snapshots. This test takes around 12 hours to run normally, going slower and slower as the test goes on. The reason for this is because fsstress is running fsync sometimes, and because we're messing with block groups we often fall through to btrfs_commit_transaction, so will often have 20-30 threads all calling btrfs_commit_transaction at the same time. These all get stuck contending on the extent tree while they try to run delayed refs during the initial part of the commit. This is suboptimal, really because the extent tree is a single point of failure we only want one thread acting on that tree at once to reduce lock contention. Fix this by making the flushing mechanism a bit operation, to make it easy to use test_and_set_bit() in order to make sure only one task does this initial flush. Once we're into the transaction commit we only have one thread doing delayed ref running, it's just this initial pre-flush that is problematic. With this patch my stress test takes around 90 minutes to run, instead of 12 hours. The memory barrier is not necessary for the flushing bit as it's ordered, unlike plain int. The transaction state accessed in btrfs_should_end_transaction could be affected by that too as it's not always used under transaction lock. Upon Nikolay's analysis in [1] it's not necessary: In should_end_transaction it's read without holding any locks. (U) It's modified in btrfs_cleanup_transaction without holding the fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set. set in cleanup_transaction under fs_info->trans_lock (L) set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this point the transaction is finished and fs_info->running_trans is NULL (U but irrelevant). So by the looks of it we can have a concurrent READ race with a WRITE, due to reads not taking a lock. In this case what we want to ensure is we either see new or old state. I consulted with Will Deacon and he said that in such a case we'd want to annotate the accesses to ->state with (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I don't think this could happen but I imagine at some point KCSAN would flag such an access as racy (which it is). [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comments regarding memory barrier ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-19 03:24:20 +08:00
unsigned long flags;
u64 run_delayed_start;
/*
* To make qgroup to skip given root.
* This is for snapshot, as btrfs_qgroup_inherit() will manually
* modify counters for snapshot and its source, so we should skip
* the snapshot in new_root/old_roots or it will get calculated twice
*/
u64 qgroup_to_skip;
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
};
btrfs: delayed-ref: Introduce better documented delayed ref structures Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ---------- so far so good ref_root owner offset ---------- I don't feel good now - Different interpretation of the same parameter Above @owner for data ref is inode number (u64), while for tree ref, it's level (int). They are even in different size range. For level we only need 0 ~ 8, while for ino it's BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even make sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and is better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, @real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered by this extent modification. This patch doesn't touch any code, but provides the basis for further refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 14:45:29 +08:00
enum btrfs_ref_type {
BTRFS_REF_NOT_SET,
BTRFS_REF_DATA,
BTRFS_REF_METADATA,
BTRFS_REF_LAST,
} __packed;
btrfs: delayed-ref: Introduce better documented delayed ref structures Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ---------- so far so good ref_root owner offset ---------- I don't feel good now - Different interpretation of the same parameter Above @owner for data ref is inode number (u64), while for tree ref, it's level (int). They are even in different size range. For level we only need 0 ~ 8, while for ino it's BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even make sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and is better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, @real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered by this extent modification. This patch doesn't touch any code, but provides the basis for further refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 14:45:29 +08:00
struct btrfs_ref {
enum btrfs_ref_type type;
enum btrfs_delayed_ref_action action;
btrfs: delayed-ref: Introduce better documented delayed ref structures Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ---------- so far so good ref_root owner offset ---------- I don't feel good now - Different interpretation of the same parameter Above @owner for data ref is inode number (u64), while for tree ref, it's level (int). They are even in different size range. For level we only need 0 ~ 8, while for ino it's BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even make sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and is better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, @real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered by this extent modification. This patch doesn't touch any code, but provides the basis for further refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 14:45:29 +08:00
/*
* Whether this extent should go through qgroup record.
*
* Normally false, but for certain cases like delayed subtree scan,
* setting this flag can hugely reduce qgroup overhead.
*/
bool skip_qgroup;
#ifdef CONFIG_BTRFS_FS_REF_VERIFY
/* Through which root is this modification. */
btrfs: delayed-ref: Introduce better documented delayed ref structures Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ---------- so far so good ref_root owner offset ---------- I don't feel good now - Different interpretation of the same parameter Above @owner for data ref is inode number (u64), while for tree ref, it's level (int). They are even in different size range. For level we only need 0 ~ 8, while for ino it's BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even make sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and is better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, @real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered by this extent modification. This patch doesn't touch any code, but provides the basis for further refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 14:45:29 +08:00
u64 real_root;
#endif
btrfs: delayed-ref: Introduce better documented delayed ref structures Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ---------- so far so good ref_root owner offset ---------- I don't feel good now - Different interpretation of the same parameter Above @owner for data ref is inode number (u64), while for tree ref, it's level (int). They are even in different size range. For level we only need 0 ~ 8, while for ino it's BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even make sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and is better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, @real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered by this extent modification. This patch doesn't touch any code, but provides the basis for further refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 14:45:29 +08:00
u64 bytenr;
u64 num_bytes;
u64 owning_root;
btrfs: delayed-ref: Introduce better documented delayed ref structures Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ---------- so far so good ref_root owner offset ---------- I don't feel good now - Different interpretation of the same parameter Above @owner for data ref is inode number (u64), while for tree ref, it's level (int). They are even in different size range. For level we only need 0 ~ 8, while for ino it's BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even make sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and is better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, @real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered by this extent modification. This patch doesn't touch any code, but provides the basis for further refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 14:45:29 +08:00
/*
* The root that owns the reference for this reference, this will be set
* or ->parent will be set, depending on what type of reference this is.
*/
u64 ref_root;
btrfs: delayed-ref: Introduce better documented delayed ref structures Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ---------- so far so good ref_root owner offset ---------- I don't feel good now - Different interpretation of the same parameter Above @owner for data ref is inode number (u64), while for tree ref, it's level (int). They are even in different size range. For level we only need 0 ~ 8, while for ino it's BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even make sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and is better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, @real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered by this extent modification. This patch doesn't touch any code, but provides the basis for further refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 14:45:29 +08:00
/* Bytenr of the parent tree block */
u64 parent;
union {
struct btrfs_data_ref data_ref;
struct btrfs_tree_ref tree_ref;
};
};
extern struct kmem_cache *btrfs_delayed_ref_head_cachep;
extern struct kmem_cache *btrfs_delayed_ref_node_cachep;
extern struct kmem_cache *btrfs_delayed_extent_op_cachep;
int __init btrfs_delayed_ref_init(void);
void __cold btrfs_delayed_ref_exit(void);
static inline u64 btrfs_calc_delayed_ref_bytes(const struct btrfs_fs_info *fs_info,
int num_delayed_refs)
{
u64 num_bytes;
num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_delayed_refs);
/*
* We have to check the mount option here because we could be enabling
* the free space tree for the first time and don't have the compat_ro
* option set yet.
*
* We need extra reservations if we have the free space tree because
* we'll have to modify that tree as well.
*/
if (btrfs_test_opt(fs_info, FREE_SPACE_TREE))
num_bytes *= 2;
return num_bytes;
}
btrfs: stop doing excessive space reservation for csum deletion Currently when reserving space for deleting the csum items for a data extent, when adding or updating a delayed ref head, we determine how many leaves of csum items we can have and then pass that number to the helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating space for all tree modifications we need when running delayed references, however the amount of space it computes is excessive for deleting csum items because: 1) It uses btrfs_calc_insert_metadata_size() which is excessive because we only need to delete csum items from the csum tree, we don't need to insert any items, so btrfs_calc_metadata_size() is all we need (as it computes space needed to delete an item); 2) If the free space tree is enabled, it doubles the amount of space, which is pointless for csum deletion since we don't need to touch the free space tree or any other tree other than the csum tree. So improve on this by tracking how many csum deletions we have and using a new helper to calculate space for csum deletions (just a wrapper around btrfs_calc_metadata_size() with a comment). This reduces the amount of space we need to reserve for csum deletions by a factor of 4, and it helps reduce the number of times we have to block space reservations and have the reclaim task enter the space flushing algorithm (flush delayed items, flush delayed refs, etc) in order to satisfy tickets. For example this results in a total time decrease when unlinking (or truncating) files with many extents, as we end up having to block on space metadata reservations less often. Example test: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/test umount $DEV &> /dev/null mkfs.btrfs -f $DEV # Use compression to quickly create files with a lot of extents # (each with a size of 128K). mount -o compress=lzo $DEV $MNT # 100G gives at least 983040 extents with a size of 128K. xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar # Flush all delalloc and clear all metadata from memory. umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) rm -f $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "rm took $dur milliseconds" umount $MNT Before this change rm took: 7504 milliseconds After this change rm took: 6574 milliseconds (-12.4%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-09 01:20:37 +08:00
static inline u64 btrfs_calc_delayed_ref_csum_bytes(const struct btrfs_fs_info *fs_info,
int num_csum_items)
{
/*
* Deleting csum items does not result in new nodes/leaves and does not
* require changing the free space tree, only the csum tree, so this is
* all we need.
*/
return btrfs_calc_metadata_size(fs_info, num_csum_items);
}
void btrfs_init_tree_ref(struct btrfs_ref *generic_ref, int level, u64 mod_root,
bool skip_qgroup);
void btrfs_init_data_ref(struct btrfs_ref *generic_ref, u64 ino, u64 offset,
u64 mod_root, bool skip_qgroup);
btrfs: delayed-ref: Introduce better documented delayed ref structures Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ---------- so far so good ref_root owner offset ---------- I don't feel good now - Different interpretation of the same parameter Above @owner for data ref is inode number (u64), while for tree ref, it's level (int). They are even in different size range. For level we only need 0 ~ 8, while for ino it's BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even make sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and is better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, @real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered by this extent modification. This patch doesn't touch any code, but provides the basis for further refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-04 14:45:29 +08:00
static inline struct btrfs_delayed_extent_op *
btrfs_alloc_delayed_extent_op(void)
{
return kmem_cache_alloc(btrfs_delayed_extent_op_cachep, GFP_NOFS);
}
static inline void
btrfs_free_delayed_extent_op(struct btrfs_delayed_extent_op *op)
{
if (op)
kmem_cache_free(btrfs_delayed_extent_op_cachep, op);
}
void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref);
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
static inline u64 btrfs_ref_head_to_space_flags(
struct btrfs_delayed_ref_head *head_ref)
{
if (head_ref->is_data)
return BTRFS_BLOCK_GROUP_DATA;
else if (head_ref->is_system)
return BTRFS_BLOCK_GROUP_SYSTEM;
return BTRFS_BLOCK_GROUP_METADATA;
}
static inline void btrfs_put_delayed_ref_head(struct btrfs_delayed_ref_head *head)
{
if (refcount_dec_and_test(&head->refs))
kmem_cache_free(btrfs_delayed_ref_head_cachep, head);
}
int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
struct btrfs_ref *generic_ref,
struct btrfs_delayed_extent_op *extent_op);
int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
struct btrfs_ref *generic_ref,
u64 reserved);
int btrfs_add_delayed_extent_op(struct btrfs_trans_handle *trans,
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
u64 bytenr, u64 num_bytes,
struct btrfs_delayed_extent_op *extent_op);
void btrfs_merge_delayed_refs(struct btrfs_fs_info *fs_info,
Btrfs: allow delayed refs to be merged Daniel Blueman reported a bug with fio+balance on a ramdisk setup. Basically what happens is the balance relocates a tree block which will drop the implicit refs for all of its children and adds a full backref. Once the block is relocated we have to add the implicit refs back, so when we cow the block again we add the implicit refs for its children back. The problem comes when the original drop ref doesn't get run before we add the implicit refs back. The delayed ref stuff will specifically prefer ADD operations over DROP to keep us from freeing up an extent that will have references to it, so we try to add the implicit ref before it is actually removed and we panic. This worked fine before because the add would have just canceled the drop out and we would have been fine. But the backref walking work needs to be able to freeze the delayed ref stuff in time so we have this ever increasing sequence number that gets attached to all new delayed ref updates which makes us not merge refs and we run into this issue. So to fix this we need to merge delayed refs. So everytime we run a clustered ref we need to try and merge all of its delayed refs. The backref walking stuff locks the delayed ref head before processing, so if we have it locked we are safe to merge any refs inside of the sequence number. If there is no sequence number we can merge all refs. Doing this not only fixes our bug but keeps the delayed ref code from adding and removing useless refs and batching together multiple refs into one search instead of one search per delayed ref, which will really help our commit times. I ran this with Daniels test and 276 and I haven't seen any problems. Thanks, Reported-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-08 04:00:32 +08:00
struct btrfs_delayed_ref_root *delayed_refs,
struct btrfs_delayed_ref_head *head);
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
struct btrfs_delayed_ref_head *
btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
u64 bytenr);
int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
struct btrfs_delayed_ref_head *head);
static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head)
{
mutex_unlock(&head->mutex);
}
void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
struct btrfs_delayed_ref_head *head);
struct btrfs_delayed_ref_head *btrfs_select_ref_head(
struct btrfs_delayed_ref_root *delayed_refs);
int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info, u64 seq);
btrfs: stop doing excessive space reservation for csum deletion Currently when reserving space for deleting the csum items for a data extent, when adding or updating a delayed ref head, we determine how many leaves of csum items we can have and then pass that number to the helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating space for all tree modifications we need when running delayed references, however the amount of space it computes is excessive for deleting csum items because: 1) It uses btrfs_calc_insert_metadata_size() which is excessive because we only need to delete csum items from the csum tree, we don't need to insert any items, so btrfs_calc_metadata_size() is all we need (as it computes space needed to delete an item); 2) If the free space tree is enabled, it doubles the amount of space, which is pointless for csum deletion since we don't need to touch the free space tree or any other tree other than the csum tree. So improve on this by tracking how many csum deletions we have and using a new helper to calculate space for csum deletions (just a wrapper around btrfs_calc_metadata_size() with a comment). This reduces the amount of space we need to reserve for csum deletions by a factor of 4, and it helps reduce the number of times we have to block space reservations and have the reclaim task enter the space flushing algorithm (flush delayed items, flush delayed refs, etc) in order to satisfy tickets. For example this results in a total time decrease when unlinking (or truncating) files with many extents, as we end up having to block on space metadata reservations less often. Example test: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/test umount $DEV &> /dev/null mkfs.btrfs -f $DEV # Use compression to quickly create files with a lot of extents # (each with a size of 128K). mount -o compress=lzo $DEV $MNT # 100G gives at least 983040 extents with a size of 128K. xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar # Flush all delalloc and clear all metadata from memory. umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) rm -f $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "rm took $dur milliseconds" umount $MNT Before this change rm took: 7504 milliseconds After this change rm took: 6574 milliseconds (-12.4%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-09 01:20:37 +08:00
void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr_refs, int nr_csums);
void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
void btrfs_inc_delayed_refs_rsv_bg_inserts(struct btrfs_fs_info *fs_info);
void btrfs_dec_delayed_refs_rsv_bg_inserts(struct btrfs_fs_info *fs_info);
void btrfs_inc_delayed_refs_rsv_bg_updates(struct btrfs_fs_info *fs_info);
void btrfs_dec_delayed_refs_rsv_bg_updates(struct btrfs_fs_info *fs_info);
int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
enum btrfs_reserve_flush_enum flush);
void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
u64 num_bytes);
bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
static inline u64 btrfs_delayed_ref_owner(struct btrfs_delayed_ref_node *node)
{
if (node->type == BTRFS_EXTENT_DATA_REF_KEY ||
node->type == BTRFS_SHARED_DATA_REF_KEY)
return node->data_ref.objectid;
return node->tree_ref.level;
}
static inline u64 btrfs_delayed_ref_offset(struct btrfs_delayed_ref_node *node)
{
if (node->type == BTRFS_EXTENT_DATA_REF_KEY ||
node->type == BTRFS_SHARED_DATA_REF_KEY)
return node->data_ref.offset;
return 0;
}
static inline u8 btrfs_ref_type(struct btrfs_ref *ref)
{
ASSERT(ref->type == BTRFS_REF_DATA || ref->type == BTRFS_REF_METADATA);
if (ref->type == BTRFS_REF_DATA) {
if (ref->parent)
return BTRFS_SHARED_DATA_REF_KEY;
else
return BTRFS_EXTENT_DATA_REF_KEY;
} else {
if (ref->parent)
return BTRFS_SHARED_BLOCK_REF_KEY;
else
return BTRFS_TREE_BLOCK_REF_KEY;
}
return 0;
}
Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-13 22:10:06 +08:00
#endif