2
0
mirror of https://github.com/edk2-porting/linux-next.git synced 2025-01-07 21:24:00 +08:00
linux-next/fs/btrfs
Filipe Manana 55ffaabe23 Btrfs: avoid unnecessary splits when setting bits on an extent io tree
When attempting to set bits on a range of an exent io tree that already
has those bits set we can end up splitting an extent state record, use
the preallocated extent state record, insert it into the red black tree,
do another search on the red black tree, merge the preallocated extent
state record with the previous extent state record, remove that previous
record from the red black tree and then free it. This is all unnecessary
work that consumes time.

This happens specifically at the following case at __set_extent_bit():

  $ cat -n fs/btrfs/extent_io.c
   957  static int __must_check
   958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
  (...)
  1044          /*
  1045           *     | ---- desired range ---- |
  1046           * | state |
  1047           *   or
  1048           * | ------------- state -------------- |
  1049           *
  (...)
  1060          if (state->start < start) {
  1061                  if (state->state & exclusive_bits) {
  1062                          *failed_start = start;
  1063                          err = -EEXIST;
  1064                          goto out;
  1065                  }
  1066
  1067                  prealloc = alloc_extent_state_atomic(prealloc);
  1068                  BUG_ON(!prealloc);
  1069                  err = split_state(tree, state, prealloc, start);
  1070                  if (err)
  1071                          extent_io_tree_panic(tree, err);
  1072
  1073                  prealloc = NULL;

So if our extent state represents a range from 0 to 1MiB for example, and
we want to set bits in the range 128KiB to 256KiB for example, and that
extent state record already has all those bits set, we end up splitting
that record, so we end up with extent state records in the tree which
represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
temporary because a subsequent iteration in that function will end up
merging the records.

The splitting requires using the preallocated extent state record, so
a future iteration that needs to do another split will need to allocate
another extent state record in an atomic context, something not ideal
that we try to avoid as much as possible. The splitting also requires
an insertion in the red black tree, and a subsequent merge will require
a deletion from the red black tree and freeing an extent state record.

This change just skips the splitting of an extent state record when it
already has all the bits the we need to set.

Setting a bit that is already set for a range is very common in the
inode's 'file_extent_tree' extent io tree for example, where we keep
setting the EXTENT_DIRTY bit every time we replace an extent.

This change also fixes a bug that happens after the recent patchset from
Josef that avoids having implicit holes after a power failure when not
using the NO_HOLES feature, more specifically the patch with the subject:

  "btrfs: introduce the inode->file_extent_tree"

This patch introduced an extent io tree per inode to keep track of
completed ordered extents and figure out at any time what is the safe
value for the inode's disk_i_size. This assumes that for contiguous
ranges in a file we always end up with a single extent state record in
the io tree, but that is not the case, as there is a short time window
where we can have two extent state records representing contiguous
ranges. When this happens we end setting up an incorrect value for the
inode's disk_i_size, resulting in data loss after a clean unmount
of the filesystem. The following example explains how this can happen.

Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
the inode's file_extent_tree we have a single extent state record that
represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
following steps happen:

1) A buffered write against file range [512KiB, 768KiB) is made. At this
   point delalloc was not flushed yet;

2) Deduplication from some other inode into this inode's range
   [128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
   to be called, from btrfs_insert_clone_extent(), to mark the range
   [128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;

3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
   end up at __set_extent_bit(). In the first iteration of that function's
   loop we end up in the following branch:

   $ cat -n fs/btrfs/extent_io.c
    957  static int __must_check
    958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
   (...)
   1044          /*
   1045           *     | ---- desired range ---- |
   1046           * | state |
   1047           *   or
   1048           * | ------------- state -------------- |
   1049           *
   (...)
   1060          if (state->start < start) {
   1061                  if (state->state & exclusive_bits) {
   1062                          *failed_start = start;
   1063                          err = -EEXIST;
   1064                          goto out;
   1065                  }
   1066
   1067                  prealloc = alloc_extent_state_atomic(prealloc);
   1068                  BUG_ON(!prealloc);
   1069                  err = split_state(tree, state, prealloc, start);
   1070                  if (err)
   1071                          extent_io_tree_panic(tree, err);
   1072
   1073                  prealloc = NULL;
   (...)
   1089                  goto search_again;

   This splits the state record into two, one for range [0, 128KiB) and
   another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
   bit set. Then we jump to the 'search_again' label, where we unlock the
   the spinlock protecting the extent io tree before jumping to the
   'again' label to perform the next iteration;

4) In the meanwhile, delalloc is flushed, the ordered extent for the range
   [512KiB, 768KiB) is created and when it completes, at
   btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
   with a value of 0 for its 'new_size' argument;

5) Before the deduplication task currently at __set_extent_bit() moves to
   the next iteration, the task finishing the ordered extent calls
   find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
   and gets 'start' set to 0 and 'end' set to 128KiB - because at this
   moment the io tree has two extent state records, one representing the
   range [0, 128KiB) and another representing the range [128KiB, 1MiB),
   both with EXTENT_DIRTY set. Then we set 'isize' to:

   isize = min(isize, end + 1)
         = min(1MiB, 128KiB - 1 + 1)
         = 128KiB

   Then we set the inode's disk_i_size to 128KiB (isize).

   After a clean unmount of the filesystem and mounting it again, we have
   the file with a size of 128KiB, and effectively lost all the data it
   had before in the range from 128KiB to 1MiB.

This change fixes that issue too, as we never end up splitting extent
state records when they already have all the bits we want set.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:38 +01:00
..
tests btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
acl.c btrfs: cleanup btrfs_setxattr_trans and drop transaction parameter 2019-04-29 19:02:44 +02:00
async-thread.c btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
async-thread.h btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
backref.c btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
backref.h btrfs: fiemap: preallocate ulists for btrfs_check_shared 2019-07-01 13:34:53 +02:00
block-group.c btrfs: switch to per-transaction pinned extents 2020-03-23 17:01:38 +01:00
block-group.h btrfs: Move and unexport btrfs_rmap_block 2020-01-23 17:24:34 +01:00
block-rsv.c btrfs: add a comment describing block reserves 2020-03-23 17:01:26 +01:00
block-rsv.h btrfs: migrate the global_block_rsv helpers to block-rsv.c 2019-07-02 12:30:55 +02:00
btrfs_inode.h btrfs: introduce per-inode file extent tree 2020-03-23 17:01:24 +01:00
check-integrity.c btrfs: remove superfluous BUG_ON() in integrity checks 2020-01-20 16:40:52 +01:00
check-integrity.h
compression.c btrfs: use larger zlib buffer for s390 hardware compression 2020-01-31 10:30:40 -08:00
compression.h btrfs: compression: remove ops pointer from workspace_manager 2019-11-18 12:46:59 +01:00
ctree.c btrfs: move root node locking helpers to locking.c 2020-03-23 17:01:33 +01:00
ctree.h btrfs: switch to per-transaction pinned extents 2020-03-23 17:01:38 +01:00
delalloc-space.c btrfs: add a comment describing delalloc space reservation 2020-03-23 17:01:27 +01:00
delalloc-space.h btrfs: migrate the delalloc space stuff to it's own home 2019-07-04 17:26:17 +02:00
delayed-inode.c btrfs: add wrapper for transaction abort predicate 2020-03-23 17:01:34 +01:00
delayed-inode.h Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root 2018-10-15 17:23:33 +02:00
delayed-ref.c Btrfs: fix race between adding and putting tree mod seq elements and nodes 2020-01-31 14:01:20 +01:00
delayed-ref.h btrfs: migrate the delayed refs rsv code 2019-07-04 17:26:17 +02:00
dev-replace.c btrfs: sysfs, rename device_link add/remove functions 2020-03-23 17:01:35 +01:00
dev-replace.h btrfs: add __pure attribute to functions 2019-11-18 12:46:52 +01:00
dir-item.c btrfs: remove unused parameter fs_info from btrfs_extend_item 2019-04-29 19:02:50 +02:00
discard.c btrfs: add correction to handle -1 edge case in async discard 2020-01-20 16:41:01 +01:00
discard.h btrfs: have multiple discard lists 2020-01-20 16:41:00 +01:00
disk-io.c btrfs: switch to per-transaction pinned extents 2020-03-23 17:01:38 +01:00
disk-io.h btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
export.c btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
export.h
extent_io.c Btrfs: avoid unnecessary splits when setting bits on an extent io tree 2020-03-23 17:01:38 +01:00
extent_io.h btrfs: sink argument tree to extent_read_full_page 2020-03-23 17:01:35 +01:00
extent_map.c Btrfs: fix race between using extent maps and merging them 2020-02-12 17:16:46 +01:00
extent_map.h btrfs: remove extent_map::bdev 2019-11-18 23:43:44 +01:00
extent-io-tree.h btrfs: switch to per-transaction pinned extents 2020-03-23 17:01:38 +01:00
extent-tree.c btrfs: handle logged extent failure properly 2020-03-23 17:01:38 +01:00
file-item.c btrfs: introduce per-inode file extent tree 2020-03-23 17:01:24 +01:00
file.c btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range 2020-03-23 17:01:34 +01:00
free-space-cache.c btrfs: switch to per-transaction pinned extents 2020-03-23 17:01:38 +01:00
free-space-cache.h btrfs: have multiple discard lists 2020-01-20 16:41:00 +01:00
free-space-tree.c btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
free-space-tree.h btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
inode-item.c btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extref 2019-09-09 14:59:16 +02:00
inode-map.c btrfs: keep track of which extents have been discarded 2020-01-20 16:40:57 +01:00
inode-map.h
inode.c btrfs: sink argument tree to extent_read_full_page 2020-03-23 17:01:35 +01:00
ioctl.c btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
Kconfig btrfs: add Kconfig dependency for BLAKE2B 2019-12-09 17:56:06 +01:00
locking.c btrfs: move root node locking helpers to locking.c 2020-03-23 17:01:33 +01:00
locking.h btrfs: move btrfs_unlock_up_safe to other locking functions 2019-11-18 12:46:49 +01:00
lzo.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00
Makefile btrfs: add the beginning of async discard, discard workqueue 2020-01-20 16:40:57 +01:00
misc.h btrfs: add 64bit safe helper for power of two checks 2019-11-18 12:46:50 +01:00
ordered-data.c btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range 2020-03-23 17:01:34 +01:00
ordered-data.h btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range 2020-03-23 17:01:34 +01:00
orphan.c
print-tree.c btrfs: Remove unneeded semicolon 2020-01-20 16:40:55 +01:00
print-tree.h
props.c btrfs: props: remove unnecessary hash_init() 2019-11-18 12:46:55 +01:00
props.h btrfs: delete unused function btrfs_set_prop_trans 2019-04-29 19:02:54 +02:00
qgroup.c btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
qgroup.h btrfs: destroy qgroup extent records on transaction abort 2020-02-19 00:35:54 +01:00
raid56.c btrfs: remove pointless local variable in lock_stripe_add() 2019-11-18 12:47:00 +01:00
raid56.h btrfs: constify map parameter for nr_parity_stripes and nr_data_stripes 2019-07-01 13:34:58 +02:00
rcu-string.h
reada.c btrfs: rename btrfs_block_group_cache 2019-11-18 17:51:51 +01:00
ref-verify.c btrfs: ref-verify: fix memory leaks 2020-02-12 17:16:31 +01:00
ref-verify.h btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod() 2019-04-29 19:02:49 +02:00
relocation.c btrfs: relocation: Remove is_cowonly_root() 2020-03-23 17:01:38 +01:00
root-tree.c btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
scrub.c btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
send.c btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root 2020-03-23 17:01:33 +01:00
send.h
space-info.c btrfs: describe the space reservation system in general 2020-03-23 17:01:27 +01:00
space-info.h btrfs: take overcommit into account in inc_block_group_ro 2020-01-31 14:02:01 +01:00
struct-funcs.c btrfs: tie extent buffer and it's token together 2019-09-09 14:59:16 +02:00
super.c btrfs: add wrapper for transaction abort predicate 2020-03-23 17:01:34 +01:00
sysfs.c btrfs: sysfs, unify handler name of devinfo/missing 2020-03-23 17:01:36 +01:00
sysfs.h btrfs: sysfs, rename device_link add/remove functions 2020-03-23 17:01:35 +01:00
transaction.c btrfs: switch to per-transaction pinned extents 2020-03-23 17:01:38 +01:00
transaction.h btrfs: switch to per-transaction pinned extents 2020-03-23 17:01:38 +01:00
tree-checker.c btrfs: tree-checker: Verify location key for DIR_ITEM/DIR_INDEX 2020-01-20 16:40:56 +01:00
tree-checker.h btrfs: get fs_info from eb in btrfs_check_chunk_valid 2019-04-29 19:02:39 +02:00
tree-defrag.c btrfs: open code now trivial btrfs_set_lock_blocking 2019-02-25 14:13:27 +01:00
tree-log.c btrfs: Make btrfs_pin_extent_for_log_replay take transaction handle 2020-03-23 17:01:37 +01:00
tree-log.h btrfs: get fs_info from trans in btrfs_set_log_full_commit 2019-04-29 19:02:41 +02:00
ulist.c
ulist.h
uuid-tree.c btrfs: handle ENOENT in btrfs_uuid_tree_iterate 2019-12-13 14:10:45 +01:00
volumes.c btrfs: sysfs, rename device_link add/remove functions 2020-03-23 17:01:35 +01:00
volumes.h for-5.6-rc1-tag 2020-02-16 11:43:45 -08:00
xattr.c Btrfs: fix failure to persist compression property xattr deletion on fsync 2019-06-17 16:37:17 +02:00
xattr.h btrfs: cleanup btrfs_setxattr_trans and drop transaction parameter 2019-04-29 19:02:44 +02:00
zlib.c btrfs: use larger zlib buffer for s390 hardware compression 2020-01-31 10:30:40 -08:00
zstd.c btrfs: compression: inline free_workspace 2019-11-18 12:46:59 +01:00