2018-04-04 01:16:55 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2007-06-12 21:07:21 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*/
|
|
|
|
|
2018-04-04 01:16:55 +08:00
|
|
|
#ifndef BTRFS_CTREE_H
|
|
|
|
#define BTRFS_CTREE_H
|
2007-02-02 22:18:22 +08:00
|
|
|
|
2007-10-16 04:18:55 +08:00
|
|
|
#include <linux/mm.h>
|
2017-02-03 02:15:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
2007-10-16 04:18:55 +08:00
|
|
|
#include <linux/highmem.h>
|
2007-03-23 00:13:20 +08:00
|
|
|
#include <linux/fs.h>
|
2011-03-08 21:14:00 +08:00
|
|
|
#include <linux/rwsem.h>
|
2013-08-15 23:11:21 +08:00
|
|
|
#include <linux/semaphore.h>
|
2007-08-30 03:47:34 +08:00
|
|
|
#include <linux/completion.h>
|
2008-03-26 22:28:07 +08:00
|
|
|
#include <linux/backing-dev.h>
|
2008-07-18 00:53:50 +08:00
|
|
|
#include <linux/wait.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
#include <trace/events/btrfs.h>
|
2019-05-22 16:18:59 +08:00
|
|
|
#include <asm/unaligned.h>
|
2011-09-22 03:05:58 +08:00
|
|
|
#include <linux/pagemap.h>
|
2013-01-29 14:04:50 +08:00
|
|
|
#include <linux/btrfs.h>
|
2016-04-02 04:14:29 +08:00
|
|
|
#include <linux/btrfs_tree.h>
|
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-14 08:29:04 +08:00
|
|
|
#include <linux/workqueue.h>
|
2014-09-23 13:40:08 +08:00
|
|
|
#include <linux/security.h>
|
2015-12-15 00:42:10 +08:00
|
|
|
#include <linux/sizes.h>
|
2016-09-01 11:55:33 +08:00
|
|
|
#include <linux/dynamic_debug.h>
|
2017-03-03 16:55:14 +08:00
|
|
|
#include <linux/refcount.h>
|
btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-08 17:45:05 +08:00
|
|
|
#include <linux/crc32c.h>
|
2020-09-25 00:39:12 +08:00
|
|
|
#include <linux/iomap.h>
|
2022-10-21 00:58:26 +08:00
|
|
|
#include <linux/fscrypt.h>
|
2019-09-23 22:05:19 +08:00
|
|
|
#include "extent-io-tree.h"
|
2008-01-25 05:13:08 +08:00
|
|
|
#include "extent_io.h"
|
2007-10-16 04:14:19 +08:00
|
|
|
#include "extent_map.h"
|
2008-06-12 04:50:36 +08:00
|
|
|
#include "async-thread.h"
|
2019-06-20 01:47:17 +08:00
|
|
|
#include "block-rsv.h"
|
2020-01-30 20:59:44 +08:00
|
|
|
#include "locking.h"
|
2022-09-09 23:27:45 +08:00
|
|
|
#include "misc.h"
|
2007-03-23 00:13:20 +08:00
|
|
|
|
2007-03-17 04:20:31 +08:00
|
|
|
struct btrfs_trans_handle;
|
2007-03-23 03:59:16 +08:00
|
|
|
struct btrfs_transaction;
|
2010-05-16 22:48:46 +08:00
|
|
|
struct btrfs_pending_snapshot;
|
2018-11-22 03:05:41 +08:00
|
|
|
struct btrfs_delayed_ref_root;
|
2019-06-19 04:09:16 +08:00
|
|
|
struct btrfs_space_info;
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group;
|
2008-07-18 00:53:50 +08:00
|
|
|
struct btrfs_ordered_sum;
|
2019-04-04 14:45:35 +08:00
|
|
|
struct btrfs_ref;
|
2021-09-15 15:17:18 +08:00
|
|
|
struct btrfs_bio;
|
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-10 08:59:07 +08:00
|
|
|
struct btrfs_ioctl_encoded_io_args;
|
2022-09-15 07:04:44 +08:00
|
|
|
struct btrfs_device;
|
|
|
|
struct btrfs_fs_devices;
|
|
|
|
struct btrfs_balance_control;
|
|
|
|
struct btrfs_delayed_root;
|
|
|
|
struct reloc_control;
|
2007-03-17 04:20:31 +08:00
|
|
|
|
2018-03-07 17:29:18 +08:00
|
|
|
#define BTRFS_OLDEST_GENERATION 0ULL
|
|
|
|
|
2007-12-13 03:38:19 +08:00
|
|
|
#define BTRFS_EMPTY_DIR_SIZE 0
|
2007-03-30 03:15:27 +08:00
|
|
|
|
2015-12-15 00:42:10 +08:00
|
|
|
#define BTRFS_DIRTY_METADATA_THRESH SZ_32M
|
2013-01-29 18:09:20 +08:00
|
|
|
|
2015-12-15 00:42:10 +08:00
|
|
|
#define BTRFS_MAX_EXTENT_SIZE SZ_128M
|
2015-02-12 04:08:59 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
static inline unsigned long btrfs_chunk_item_size(int num_stripes)
|
|
|
|
{
|
|
|
|
BUG_ON(num_stripes == 0);
|
|
|
|
return sizeof(struct btrfs_chunk) +
|
|
|
|
sizeof(struct btrfs_stripe) * (num_stripes - 1);
|
|
|
|
}
|
|
|
|
|
2021-10-21 07:44:47 +08:00
|
|
|
#define BTRFS_SUPER_INFO_OFFSET SZ_64K
|
|
|
|
#define BTRFS_SUPER_INFO_SIZE 4096
|
2022-09-14 23:06:28 +08:00
|
|
|
static_assert(sizeof(struct btrfs_super_block) == BTRFS_SUPER_INFO_SIZE);
|
2021-10-21 07:44:47 +08:00
|
|
|
|
2022-06-13 15:06:34 +08:00
|
|
|
/*
|
|
|
|
* The reserved space at the beginning of each device.
|
|
|
|
* It covers the primary super block and leaves space for potential use by other
|
|
|
|
* tools like bootloaders or to lower potential damage of accidental overwrite.
|
|
|
|
*/
|
|
|
|
#define BTRFS_DEVICE_RANGE_RESERVED (SZ_1M)
|
|
|
|
|
2021-03-31 18:56:21 +08:00
|
|
|
/* Read ahead values for struct btrfs_path.reada */
|
|
|
|
enum {
|
|
|
|
READA_NONE,
|
|
|
|
READA_BACK,
|
|
|
|
READA_FORWARD,
|
|
|
|
/*
|
|
|
|
* Similar to READA_FORWARD but unlike it:
|
|
|
|
*
|
|
|
|
* 1) It will trigger readahead even for leaves that are not close to
|
|
|
|
* each other on disk;
|
|
|
|
* 2) It also triggers readahead for nodes;
|
|
|
|
* 3) During a search, even when a node or leaf is already in memory, it
|
|
|
|
* will still trigger readahead for other nodes and leaves that follow
|
|
|
|
* it.
|
|
|
|
*
|
|
|
|
* This is meant to be used only when we know we are iterating over the
|
|
|
|
* entire tree or a very large part of it.
|
|
|
|
*/
|
|
|
|
READA_FORWARD_ALWAYS,
|
|
|
|
};
|
|
|
|
|
2007-02-26 23:40:21 +08:00
|
|
|
/*
|
2007-03-13 22:46:10 +08:00
|
|
|
* btrfs_paths remember the path taken from the root down to the leaf.
|
|
|
|
* level 0 is always the leaf, and nodes[1...BTRFS_MAX_LEVEL] will point
|
2007-02-26 23:40:21 +08:00
|
|
|
* to any other levels that are present.
|
|
|
|
*
|
|
|
|
* The slots array records the index of the item or block pointer
|
|
|
|
* used while walking the tree.
|
|
|
|
*/
|
2007-03-13 22:46:10 +08:00
|
|
|
struct btrfs_path {
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *nodes[BTRFS_MAX_LEVEL];
|
2007-03-13 22:46:10 +08:00
|
|
|
int slots[BTRFS_MAX_LEVEL];
|
2008-06-26 04:01:30 +08:00
|
|
|
/* if there is real range locking, this locks field will change */
|
2015-11-27 23:31:45 +08:00
|
|
|
u8 locks[BTRFS_MAX_LEVEL];
|
2015-11-27 23:31:38 +08:00
|
|
|
u8 reada;
|
2008-06-26 04:01:30 +08:00
|
|
|
/* keep some upper locks as we walk down */
|
2015-11-27 23:31:42 +08:00
|
|
|
u8 lowest_level;
|
2008-12-10 22:10:46 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* set by btrfs_split_item, tells search_slot to keep all locks
|
|
|
|
* and to force calls to keep space in the nodes
|
|
|
|
*/
|
2009-03-13 23:00:37 +08:00
|
|
|
unsigned int search_for_split:1;
|
|
|
|
unsigned int keep_locks:1;
|
|
|
|
unsigned int skip_locking:1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
unsigned int search_commit_root:1;
|
2014-03-29 05:16:01 +08:00
|
|
|
unsigned int need_commit_sem:1;
|
2014-11-09 16:38:39 +08:00
|
|
|
unsigned int skip_release_on_error:1;
|
btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.
item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.
When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.
And split_leaf return EOVERFLOW from following code:
if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;
In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.
However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:
data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.
The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly
[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.
There are two ways to fix this. One is to simply revert the patch
878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the
condition consistent although that patch is correct about the size.
The other way is to handle the leaf space check correctly when
collision happens. I prefer the second one since it correct leaf
space check in collision case. This fix will not account
sizeof(struct btrfs_item) when the item already exists.
There are two places where ins_len doesn't contain
sizeof(struct btrfs_item), however.
1. extent-tree.c: lookup_inline_extent_backref
2. file-item.c: btrfs_csum_file_blocks
to make the logic of btrfs_search_slot more clear, we add a flag
search_for_extension in btrfs_path.
This flag indicates that ins_len passed to btrfs_search_slot doesn't
contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
will use the actual size needed to calculate the required leaf space.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-01 17:25:12 +08:00
|
|
|
/*
|
|
|
|
* Indicate that new item (btrfs_search_slot) is extending already
|
|
|
|
* existing item and ins_len contains only the data size and not item
|
|
|
|
* header (ie. sizeof(struct btrfs_item) is not included).
|
|
|
|
*/
|
|
|
|
unsigned int search_for_extension:1;
|
2022-09-13 03:27:42 +08:00
|
|
|
/* Stop search if any locks need to be taken (for read) */
|
|
|
|
unsigned int nowait:1;
|
2007-02-02 22:18:22 +08:00
|
|
|
};
|
2022-09-15 07:04:46 +08:00
|
|
|
|
2012-11-06 00:26:40 +08:00
|
|
|
struct btrfs_dev_replace {
|
|
|
|
u64 replace_state; /* see #define above */
|
2018-06-12 19:48:25 +08:00
|
|
|
time64_t time_started; /* seconds since 1-Jan-1970 */
|
|
|
|
time64_t time_stopped; /* seconds since 1-Jan-1970 */
|
2012-11-06 00:26:40 +08:00
|
|
|
atomic64_t num_write_errors;
|
|
|
|
atomic64_t num_uncorrectable_read_errors;
|
|
|
|
|
|
|
|
u64 cursor_left;
|
|
|
|
u64 committed_cursor_left;
|
|
|
|
u64 cursor_left_last_write_of_item;
|
|
|
|
u64 cursor_right;
|
|
|
|
|
|
|
|
u64 cont_reading_from_srcdev_mode; /* see #define above */
|
|
|
|
|
|
|
|
int is_valid;
|
|
|
|
int item_needs_writeback;
|
|
|
|
struct btrfs_device *srcdev;
|
|
|
|
struct btrfs_device *tgtdev;
|
|
|
|
|
|
|
|
struct mutex lock_finishing_cancel_unmount;
|
2018-04-05 07:29:24 +08:00
|
|
|
struct rw_semaphore rwsem;
|
2012-11-06 00:26:40 +08:00
|
|
|
|
|
|
|
struct btrfs_scrub_progress scrub_progress;
|
2018-04-05 07:04:49 +08:00
|
|
|
|
|
|
|
struct percpu_counter bio_counter;
|
|
|
|
wait_queue_head_t replace_wait;
|
2012-11-06 00:26:40 +08:00
|
|
|
};
|
|
|
|
|
2009-04-03 21:47:43 +08:00
|
|
|
/*
|
|
|
|
* free clusters are used to claim free space in relatively large chunks,
|
btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd. In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.
This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode. The metadata behaviour and
additional ssd_spread option stay untouched so far.
Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings. The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.
The SSD story...
In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".
The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).
A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.
The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.
Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.
However, there's a number of problems with this approach.
First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.
Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.
Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free. There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data. The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.
Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.
The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".
The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.
Changes made in the code
1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.
Backporting notes
Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
for example, instead of using fs_info, it's root->fs_info in
extent-tree.c
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-28 14:31:28 +08:00
|
|
|
* allowing us to do less seeky writes. They are used for all metadata
|
|
|
|
* allocations. In ssd_spread mode they are also used for data allocations.
|
2009-04-03 21:47:43 +08:00
|
|
|
*/
|
|
|
|
struct btrfs_free_cluster {
|
|
|
|
spinlock_t lock;
|
|
|
|
spinlock_t refill_lock;
|
|
|
|
struct rb_root root;
|
|
|
|
|
|
|
|
/* largest extent in this cluster */
|
|
|
|
u64 max_size;
|
|
|
|
|
|
|
|
/* first extent starting offset */
|
|
|
|
u64 window_start;
|
|
|
|
|
2015-10-03 03:25:10 +08:00
|
|
|
/* We did a full search and couldn't create a cluster */
|
|
|
|
bool fragmented;
|
|
|
|
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *block_group;
|
2009-04-03 21:47:43 +08:00
|
|
|
/*
|
|
|
|
* when a cluster is allocated from a block group, we put the
|
|
|
|
* cluster onto a list in the block group so that it can
|
|
|
|
* be freed before the block group is freed.
|
|
|
|
*/
|
|
|
|
struct list_head block_group_list;
|
2008-03-25 03:01:59 +08:00
|
|
|
};
|
|
|
|
|
2019-12-14 08:22:14 +08:00
|
|
|
/* Discard control. */
|
|
|
|
/*
|
|
|
|
* Async discard uses multiple lists to differentiate the discard filter
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-14 08:22:15 +08:00
|
|
|
* parameters. Index 0 is for completely free block groups where we need to
|
|
|
|
* ensure the entire block group is trimmed without being lossy. Indices
|
|
|
|
* afterwards represent monotonically decreasing discard filter sizes to
|
|
|
|
* prioritize what should be discarded next.
|
2019-12-14 08:22:14 +08:00
|
|
|
*/
|
2020-01-03 05:26:39 +08:00
|
|
|
#define BTRFS_NR_DISCARD_LISTS 3
|
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-14 08:22:15 +08:00
|
|
|
#define BTRFS_DISCARD_INDEX_UNUSED 0
|
|
|
|
#define BTRFS_DISCARD_INDEX_START 1
|
2019-12-14 08:22:14 +08:00
|
|
|
|
|
|
|
struct btrfs_discard_ctl {
|
|
|
|
struct workqueue_struct *discard_workers;
|
|
|
|
struct delayed_work work;
|
|
|
|
spinlock_t lock;
|
|
|
|
struct btrfs_block_group *block_group;
|
|
|
|
struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
|
2020-01-03 05:26:36 +08:00
|
|
|
u64 prev_discard;
|
2020-11-04 17:45:53 +08:00
|
|
|
u64 prev_discard_time;
|
2019-12-14 08:22:20 +08:00
|
|
|
atomic_t discardable_extents;
|
2019-12-14 08:22:21 +08:00
|
|
|
atomic64_t discardable_bytes;
|
2020-01-03 05:26:38 +08:00
|
|
|
u64 max_discard_size;
|
2020-11-04 17:45:52 +08:00
|
|
|
u64 delay_ms;
|
2020-01-03 05:26:35 +08:00
|
|
|
u32 iops_limit;
|
2020-01-03 05:26:36 +08:00
|
|
|
u32 kbps_limit;
|
2020-01-03 05:26:41 +08:00
|
|
|
u64 discard_extent_bytes;
|
|
|
|
u64 discard_bitmap_bytes;
|
|
|
|
atomic64_t discard_bytes_saved;
|
2019-12-14 08:22:14 +08:00
|
|
|
};
|
|
|
|
|
2020-08-25 23:02:32 +08:00
|
|
|
/*
|
|
|
|
* Exclusive operations (device replace, resize, device add/remove, balance)
|
|
|
|
*/
|
|
|
|
enum btrfs_exclusive_operation {
|
|
|
|
BTRFS_EXCLOP_NONE,
|
2021-11-25 17:14:41 +08:00
|
|
|
BTRFS_EXCLOP_BALANCE_PAUSED,
|
2020-08-25 23:02:32 +08:00
|
|
|
BTRFS_EXCLOP_BALANCE,
|
|
|
|
BTRFS_EXCLOP_DEV_ADD,
|
|
|
|
BTRFS_EXCLOP_DEV_REMOVE,
|
|
|
|
BTRFS_EXCLOP_DEV_REPLACE,
|
|
|
|
BTRFS_EXCLOP_RESIZE,
|
|
|
|
BTRFS_EXCLOP_SWAP_ACTIVATE,
|
|
|
|
};
|
|
|
|
|
2022-06-15 06:22:32 +08:00
|
|
|
/* Store data about transaction commits, exported via sysfs. */
|
|
|
|
struct btrfs_commit_stats {
|
|
|
|
/* Total number of commits */
|
|
|
|
u64 commit_count;
|
|
|
|
/* The maximum commit duration so far in ns */
|
|
|
|
u64 max_commit_dur;
|
|
|
|
/* The last commit duration in ns */
|
|
|
|
u64 last_commit_dur;
|
|
|
|
/* The total commit duration in ns */
|
|
|
|
u64 total_commit_dur;
|
|
|
|
};
|
|
|
|
|
2007-03-21 02:38:32 +08:00
|
|
|
struct btrfs_fs_info {
|
2008-04-16 03:41:47 +08:00
|
|
|
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
|
2016-09-03 03:40:02 +08:00
|
|
|
unsigned long flags;
|
2007-03-16 00:56:47 +08:00
|
|
|
struct btrfs_root *tree_root;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_root *chunk_root;
|
|
|
|
struct btrfs_root *dev_root;
|
2008-11-18 10:02:50 +08:00
|
|
|
struct btrfs_root *fs_root;
|
2011-09-13 18:56:09 +08:00
|
|
|
struct btrfs_root *quota_root;
|
2013-08-15 23:11:19 +08:00
|
|
|
struct btrfs_root *uuid_root;
|
2020-05-15 14:01:42 +08:00
|
|
|
struct btrfs_root *data_reloc_root;
|
2021-12-16 04:40:07 +08:00
|
|
|
struct btrfs_root *block_group_root;
|
2008-09-06 04:13:11 +08:00
|
|
|
|
|
|
|
/* the log root tree is a directory of all the other log roots */
|
|
|
|
struct btrfs_root *log_root_tree;
|
2009-09-22 03:56:00 +08:00
|
|
|
|
2021-11-06 04:45:51 +08:00
|
|
|
/* The tree that holds the global roots (csum, extent, etc) */
|
|
|
|
rwlock_t global_root_lock;
|
|
|
|
struct rb_root global_root_tree;
|
|
|
|
|
2022-07-15 19:59:21 +08:00
|
|
|
spinlock_t fs_roots_radix_lock;
|
|
|
|
struct radix_tree_root fs_roots_radix;
|
2007-10-16 04:15:26 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
/* block group cache stuff */
|
btrfs: use a read/write lock for protecting the block groups tree
Currently we use a spin lock to protect the red black tree that we use to
track block groups. Most accesses to that tree are actually read only and
for large filesystems, with thousands of block groups, it actually has
a bad impact on performance, as concurrent read only searches on the tree
are serialized.
Read only searches on the tree are very frequent and done when:
1) Pinning and unpinning extents, as we need to lookup the respective
block group from the tree;
2) Freeing the last reference of a tree block, regardless if we pin the
underlying extent or add it back to free space cache/tree;
3) During NOCOW writes, both buffered IO and direct IO, we need to check
if the block group that contains an extent is read only or not and to
increment the number of NOCOW writers in the block group. For those
operations we need to search for the block group in the tree.
Similarly, after creating the ordered extent for the NOCOW write, we
need to decrement the number of NOCOW writers from the same block
group, which requires searching for it in the tree;
4) Decreasing the number of extent reservations in a block group;
5) When allocating extents and freeing reserved extents;
6) Adding and removing free space to the free space tree;
7) When releasing delalloc bytes during ordered extent completion;
8) When relocating a block group;
9) During fitrim, to iterate over the block groups;
10) etc;
Write accesses to the tree, to add or remove block groups, are much less
frequent as they happen only when allocating a new block group or when
deleting a block group.
We also use the same spin lock to protect the list of currently caching
block groups. Additions to this list are made when we need to cache a
block group, because we don't have a free space cache for it (or we have
but it's invalid), and removals from this list are done when caching of
the block group's free space finishes. These cases are also not very
common, but when they happen, they happen only once when the filesystem
is mounted.
So switch the lock that protects the tree of block groups from a spinning
lock to a read/write lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-13 23:20:41 +08:00
|
|
|
rwlock_t block_group_cache_lock;
|
2022-04-13 23:20:40 +08:00
|
|
|
struct rb_root_cached block_group_cache_tree;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2011-09-27 05:12:22 +08:00
|
|
|
/* keep track of unallocated space */
|
2017-05-11 14:17:46 +08:00
|
|
|
atomic64_t free_chunk_space;
|
2011-09-27 05:12:22 +08:00
|
|
|
|
2020-01-20 22:09:18 +08:00
|
|
|
/* Track ranges which are used by log trees blocks/logged data extents */
|
|
|
|
struct extent_io_tree excluded_extents;
|
2007-10-16 04:15:26 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
/* logical->physical extent mapping */
|
2019-05-17 17:43:17 +08:00
|
|
|
struct extent_map_tree mapping_tree;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
/*
|
|
|
|
* block reservation for extent, checksum, root tree and
|
|
|
|
* delayed dir index item
|
|
|
|
*/
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_block_rsv global_block_rsv;
|
|
|
|
/* block reservation for metadata operations */
|
|
|
|
struct btrfs_block_rsv trans_block_rsv;
|
|
|
|
/* block reservation for chunk tree */
|
|
|
|
struct btrfs_block_rsv chunk_block_rsv;
|
2011-11-04 10:54:25 +08:00
|
|
|
/* block reservation for delayed operations */
|
|
|
|
struct btrfs_block_rsv delayed_block_rsv;
|
btrfs: introduce delayed_refs_rsv
Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv. This works most
of the time, except when it doesn't. We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction. Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.
So instead give them their own block_rsv. This way we can always know
exactly how much outstanding space we need for delayed refs. This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system. Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.
For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums. We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.
Historical background:
The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
if we're running short on space and when it's time to force a commit. A
failure rate of 20-40 file systems when we run hundreds of thousands of
them isn't super high, but cleaning up this code will make things less
ugly and more predictible.
Thus the delayed refs rsv. We always know how many delayed refs we have
outstanding, and although running them generates more we can use the
global reserve for that spill over, which fits better into it's desired
use than a full blown reservation. This first approach is to simply
take how many times we're reserving space for and multiply that by 2 in
order to save enough space for the delayed refs that could be generated.
This is a niave approach and will probably evolve, but for now it works.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
[ added background notes from the cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-03 23:20:33 +08:00
|
|
|
/* block reservation for delayed refs */
|
|
|
|
struct btrfs_block_rsv delayed_refs_rsv;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
|
|
|
struct btrfs_block_rsv empty_block_rsv;
|
|
|
|
|
2007-03-21 03:57:25 +08:00
|
|
|
u64 generation;
|
2007-08-11 04:22:09 +08:00
|
|
|
u64 last_trans_committed;
|
btrfs: make send work with concurrent block group relocation
We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.
The restriction between balance and send was added in commit 9e967495e0e0
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.
Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.
For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.
This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:
1) For all tree searches, send acquires a read lock on the commit root
semaphore;
2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);
3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;
4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;
5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.
This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).
Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.
A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-22 20:03:38 +08:00
|
|
|
/*
|
|
|
|
* Generation of the last transaction used for block group relocation
|
|
|
|
* since the filesystem was last mounted (or 0 if none happened yet).
|
|
|
|
* Must be written and read while holding btrfs_fs_info::commit_root_sem.
|
|
|
|
*/
|
|
|
|
u64 last_reloc_trans;
|
2014-01-23 23:54:11 +08:00
|
|
|
u64 avg_delayed_ref_runtime;
|
2009-03-24 22:24:20 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this is updated to the current trans every time a full commit
|
|
|
|
* is required instead of the faster short fsync log commits
|
|
|
|
*/
|
|
|
|
u64 last_trans_log_full_commit;
|
2012-03-30 19:58:32 +08:00
|
|
|
unsigned long mount_opt;
|
2022-10-19 22:50:56 +08:00
|
|
|
|
2010-12-17 14:21:50 +08:00
|
|
|
unsigned long compress_type:4;
|
2017-09-15 23:36:57 +08:00
|
|
|
unsigned int compress_level;
|
2018-02-13 17:50:46 +08:00
|
|
|
u32 commit_interval;
|
2013-01-29 18:05:05 +08:00
|
|
|
/*
|
|
|
|
* It is a suggestive number, the read side is safe even it gets a
|
|
|
|
* wrong number because we will write out the data into a regular
|
|
|
|
* extent. The write side(mount/remount) is under ->s_umount lock,
|
|
|
|
* so it is also safe.
|
|
|
|
*/
|
2008-01-30 05:03:38 +08:00
|
|
|
u64 max_inline;
|
2017-06-15 07:30:06 +08:00
|
|
|
|
2007-03-23 03:59:16 +08:00
|
|
|
struct btrfs_transaction *running_transaction;
|
2008-07-18 00:53:50 +08:00
|
|
|
wait_queue_head_t transaction_throttle;
|
2008-07-18 00:54:14 +08:00
|
|
|
wait_queue_head_t transaction_wait;
|
2010-10-30 03:37:34 +08:00
|
|
|
wait_queue_head_t transaction_blocked_wait;
|
2008-11-07 11:02:51 +08:00
|
|
|
wait_queue_head_t async_submit_wait;
|
2008-09-06 04:13:11 +08:00
|
|
|
|
2013-04-11 18:30:16 +08:00
|
|
|
/*
|
|
|
|
* Used to protect the incompat_flags, compat_flags, compat_ro_flags
|
|
|
|
* when they are updated.
|
|
|
|
*
|
|
|
|
* Because we do not clear the flags for ever, so we needn't use
|
|
|
|
* the lock on the read side.
|
|
|
|
*
|
|
|
|
* We also needn't use the lock when we mount the fs, because
|
|
|
|
* there is no other task which will update the flag.
|
|
|
|
*/
|
|
|
|
spinlock_t super_lock;
|
2011-04-13 21:41:04 +08:00
|
|
|
struct btrfs_super_block *super_copy;
|
|
|
|
struct btrfs_super_block *super_for_commit;
|
2007-03-23 00:13:20 +08:00
|
|
|
struct super_block *sb;
|
2007-03-29 01:57:48 +08:00
|
|
|
struct inode *btree_inode;
|
2008-09-06 04:13:11 +08:00
|
|
|
struct mutex tree_log_mutex;
|
2008-06-26 04:01:31 +08:00
|
|
|
struct mutex transaction_kthread_mutex;
|
|
|
|
struct mutex cleaner_mutex;
|
2008-06-26 04:01:30 +08:00
|
|
|
struct mutex chunk_mutex;
|
2013-01-30 07:40:14 +08:00
|
|
|
|
2015-04-07 03:46:08 +08:00
|
|
|
/*
|
|
|
|
* this is taken to make sure we don't set block groups ro after
|
|
|
|
* the free space cache has been allocated on them
|
|
|
|
*/
|
|
|
|
struct mutex ro_block_group_mutex;
|
|
|
|
|
2013-01-30 07:40:14 +08:00
|
|
|
/* this is used during read/modify/write to make sure
|
|
|
|
* no two ios are trying to mod the same stripe at the same
|
|
|
|
* time
|
|
|
|
*/
|
|
|
|
struct btrfs_stripe_hash_table *stripe_hash_table;
|
|
|
|
|
2009-04-01 01:27:11 +08:00
|
|
|
/*
|
|
|
|
* this protects the ordered operations list only while we are
|
|
|
|
* processing all of the entries on it. This way we make
|
|
|
|
* sure the commit code doesn't find the list temporarily empty
|
|
|
|
* because another function happens to be doing non-waiting preflush
|
|
|
|
* before jumping into the main commit.
|
|
|
|
*/
|
|
|
|
struct mutex ordered_operations_mutex;
|
2013-08-14 23:33:56 +08:00
|
|
|
|
2014-03-14 03:42:13 +08:00
|
|
|
struct rw_semaphore commit_root_sem;
|
2009-04-01 01:27:11 +08:00
|
|
|
|
2009-11-12 17:34:40 +08:00
|
|
|
struct rw_semaphore cleanup_work_sem;
|
2009-09-22 04:00:26 +08:00
|
|
|
|
2009-11-12 17:34:40 +08:00
|
|
|
struct rw_semaphore subvol_sem;
|
2009-09-22 04:00:26 +08:00
|
|
|
|
2011-04-12 05:25:13 +08:00
|
|
|
spinlock_t trans_lock;
|
2011-06-14 08:00:16 +08:00
|
|
|
/*
|
|
|
|
* the reloc mutex goes with the trans lock, it is taken
|
|
|
|
* during commit to protect us from the relocation code
|
|
|
|
*/
|
|
|
|
struct mutex reloc_mutex;
|
|
|
|
|
2007-04-20 09:01:03 +08:00
|
|
|
struct list_head trans_list;
|
2007-06-09 06:11:48 +08:00
|
|
|
struct list_head dead_roots;
|
2009-09-12 04:11:19 +08:00
|
|
|
struct list_head caching_block_groups;
|
2008-09-06 04:13:11 +08:00
|
|
|
|
2009-11-12 17:36:34 +08:00
|
|
|
spinlock_t delayed_iput_lock;
|
|
|
|
struct list_head delayed_iputs;
|
2018-12-04 00:06:52 +08:00
|
|
|
atomic_t nr_delayed_iputs;
|
|
|
|
wait_queue_head_t delayed_iputs_wait;
|
2009-11-12 17:36:34 +08:00
|
|
|
|
2013-04-25 00:57:33 +08:00
|
|
|
atomic64_t tree_mod_seq;
|
2012-05-16 23:55:38 +08:00
|
|
|
|
Btrfs: fix race between adding and putting tree mod seq elements and nodes
There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.
Consider the following example that explains how/why the problems happens:
1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;
2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';
3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;
4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;
5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;
6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;
7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.
This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:
[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.
Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-22 20:23:20 +08:00
|
|
|
/* this protects tree_mod_log and tree_mod_seq_list */
|
2012-05-16 23:55:38 +08:00
|
|
|
rwlock_t tree_mod_log_lock;
|
|
|
|
struct rb_root tree_mod_log;
|
Btrfs: fix race between adding and putting tree mod seq elements and nodes
There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.
Consider the following example that explains how/why the problems happens:
1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;
2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';
3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;
4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;
5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;
6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;
7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.
This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:
[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.
Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-22 20:23:20 +08:00
|
|
|
struct list_head tree_mod_seq_list;
|
2012-05-16 23:55:38 +08:00
|
|
|
|
2008-11-07 11:02:51 +08:00
|
|
|
atomic_t async_delalloc_pages;
|
2008-04-10 04:28:12 +08:00
|
|
|
|
2008-07-24 23:57:52 +08:00
|
|
|
/*
|
2013-05-15 15:48:23 +08:00
|
|
|
* this is used to protect the following list -- ordered_roots.
|
2008-07-24 23:57:52 +08:00
|
|
|
*/
|
2013-05-15 15:48:23 +08:00
|
|
|
spinlock_t ordered_root_lock;
|
2009-04-01 01:27:11 +08:00
|
|
|
|
|
|
|
/*
|
2013-05-15 15:48:23 +08:00
|
|
|
* all fs/file tree roots in which there are data=ordered extents
|
|
|
|
* pending writeback are added into this list.
|
|
|
|
*
|
2009-04-01 01:27:11 +08:00
|
|
|
* these can span multiple transactions and basically include
|
|
|
|
* every dirty data page that isn't from nodatacow
|
|
|
|
*/
|
2013-05-15 15:48:23 +08:00
|
|
|
struct list_head ordered_roots;
|
2009-04-01 01:27:11 +08:00
|
|
|
|
2014-03-06 13:55:03 +08:00
|
|
|
struct mutex delalloc_root_mutex;
|
2013-05-15 15:48:22 +08:00
|
|
|
spinlock_t delalloc_root_lock;
|
|
|
|
/* all fs/file tree roots that have delalloc inodes. */
|
|
|
|
struct list_head delalloc_roots;
|
2008-07-24 23:57:52 +08:00
|
|
|
|
2008-06-12 04:50:36 +08:00
|
|
|
/*
|
|
|
|
* there is a pool of worker threads for checksumming during writes
|
|
|
|
* and a pool for checksumming after reads. This is because readers
|
|
|
|
* can run with FS locks held, and the writers may be waiting for
|
|
|
|
* those locks. We don't want ordering in the pending list to cause
|
|
|
|
* deadlocks, and so the two are serviced separately.
|
2008-06-13 02:46:17 +08:00
|
|
|
*
|
|
|
|
* A third pool does submit_bio to avoid deadlocking with the other
|
|
|
|
* two
|
2008-06-12 04:50:36 +08:00
|
|
|
*/
|
2014-02-28 10:46:19 +08:00
|
|
|
struct btrfs_workqueue *workers;
|
2022-04-18 12:43:09 +08:00
|
|
|
struct btrfs_workqueue *hipri_workers;
|
2014-02-28 10:46:19 +08:00
|
|
|
struct btrfs_workqueue *delalloc_workers;
|
|
|
|
struct btrfs_workqueue *flush_workers;
|
2022-05-26 15:36:40 +08:00
|
|
|
struct workqueue_struct *endio_workers;
|
|
|
|
struct workqueue_struct *endio_meta_workers;
|
2022-05-26 15:36:36 +08:00
|
|
|
struct workqueue_struct *endio_raid56_workers;
|
2022-04-18 12:43:11 +08:00
|
|
|
struct workqueue_struct *rmw_workers;
|
2022-05-26 15:36:38 +08:00
|
|
|
struct workqueue_struct *compressed_write_workers;
|
2014-02-28 10:46:19 +08:00
|
|
|
struct btrfs_workqueue *endio_write_workers;
|
|
|
|
struct btrfs_workqueue *endio_freespace_worker;
|
|
|
|
struct btrfs_workqueue *caching_workers;
|
2011-07-01 02:42:28 +08:00
|
|
|
|
2008-07-18 00:53:51 +08:00
|
|
|
/*
|
|
|
|
* fixup workers take dirty pages that didn't properly go through
|
|
|
|
* the cow mechanism and make them safe to write. It happens
|
|
|
|
* for the sys_munmap function call path
|
|
|
|
*/
|
2014-02-28 10:46:19 +08:00
|
|
|
struct btrfs_workqueue *fixup_workers;
|
|
|
|
struct btrfs_workqueue *delayed_workers;
|
2014-05-23 07:18:52 +08:00
|
|
|
|
2008-06-26 04:01:31 +08:00
|
|
|
struct task_struct *transaction_kthread;
|
|
|
|
struct task_struct *cleaner_kthread;
|
2018-02-13 17:50:42 +08:00
|
|
|
u32 thread_pool_size;
|
2008-06-12 04:50:36 +08:00
|
|
|
|
2013-11-02 01:07:04 +08:00
|
|
|
struct kobject *space_info_kobj;
|
2020-06-28 13:07:15 +08:00
|
|
|
struct kobject *qgroups_kobj;
|
2022-07-26 03:15:15 +08:00
|
|
|
struct kobject *discard_kobj;
|
2007-03-21 02:38:32 +08:00
|
|
|
|
2013-01-29 18:09:20 +08:00
|
|
|
/* used to keep from writing metadata until there is a nice batch */
|
|
|
|
struct percpu_counter dirty_metadata_bytes;
|
2013-01-29 18:10:51 +08:00
|
|
|
struct percpu_counter delalloc_bytes;
|
2020-10-09 21:28:20 +08:00
|
|
|
struct percpu_counter ordered_bytes;
|
2013-01-29 18:09:20 +08:00
|
|
|
s32 dirty_metadata_batch;
|
2013-01-29 18:10:51 +08:00
|
|
|
s32 delalloc_batch;
|
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
struct list_head dirty_cowonly_roots;
|
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices;
|
2009-03-11 00:39:20 +08:00
|
|
|
|
|
|
|
/*
|
2018-03-21 03:25:25 +08:00
|
|
|
* The space_info list is effectively read only after initial
|
|
|
|
* setup. It is populated at mount time and cleaned up after
|
|
|
|
* all block groups are removed. RCU is used to protect it.
|
2009-03-11 00:39:20 +08:00
|
|
|
*/
|
2008-03-25 03:01:59 +08:00
|
|
|
struct list_head space_info;
|
2009-03-11 00:39:20 +08:00
|
|
|
|
2012-07-10 10:21:07 +08:00
|
|
|
struct btrfs_space_info *data_sinfo;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct reloc_control *reloc_ctl;
|
|
|
|
|
btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd. In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.
This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode. The metadata behaviour and
additional ssd_spread option stay untouched so far.
Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings. The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.
The SSD story...
In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".
The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).
A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.
The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.
Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.
However, there's a number of problems with this approach.
First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.
Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.
Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free. There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data. The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.
Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.
The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".
The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.
Changes made in the code
1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.
Backporting notes
Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
for example, instead of using fs_info, it's root->fs_info in
extent-tree.c
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-28 14:31:28 +08:00
|
|
|
/* data_alloc_cluster is only used in ssd_spread mode */
|
2009-04-03 21:47:43 +08:00
|
|
|
struct btrfs_free_cluster data_alloc_cluster;
|
|
|
|
|
|
|
|
/* all metadata allocations go through this cluster */
|
|
|
|
struct btrfs_free_cluster meta_alloc_cluster;
|
2008-04-05 03:40:00 +08:00
|
|
|
|
2011-05-25 03:35:30 +08:00
|
|
|
/* auto defrag inodes go here */
|
|
|
|
spinlock_t defrag_inodes_lock;
|
|
|
|
struct rb_root defrag_inodes;
|
|
|
|
atomic_t defrag_running;
|
|
|
|
|
2013-01-29 18:13:12 +08:00
|
|
|
/* Used to protect avail_{data, metadata, system}_alloc_bits */
|
|
|
|
seqlock_t profiles_lock;
|
2012-01-17 04:04:47 +08:00
|
|
|
/*
|
|
|
|
* these three are in extended format (availability of single
|
|
|
|
* chunks is denoted by BTRFS_AVAIL_ALLOC_BIT_SINGLE bit, other
|
|
|
|
* types are denoted by corresponding BTRFS_BLOCK_GROUP_* bits)
|
|
|
|
*/
|
2008-04-05 03:40:00 +08:00
|
|
|
u64 avail_data_alloc_bits;
|
|
|
|
u64 avail_metadata_alloc_bits;
|
|
|
|
u64 avail_system_alloc_bits;
|
2008-04-29 03:29:42 +08:00
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
/* restriper state */
|
|
|
|
spinlock_t balance_lock;
|
|
|
|
struct mutex balance_mutex;
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_t balance_pause_req;
|
2012-01-17 04:04:49 +08:00
|
|
|
atomic_t balance_cancel_req;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_balance_control *balance_ctl;
|
2012-01-17 04:04:49 +08:00
|
|
|
wait_queue_head_t balance_wait_q;
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2021-05-18 06:37:36 +08:00
|
|
|
/* Cancellation requests for chunk relocation */
|
|
|
|
atomic_t reloc_cancel_req;
|
|
|
|
|
2018-02-26 16:46:05 +08:00
|
|
|
u32 data_chunk_allocations;
|
|
|
|
u32 metadata_ratio;
|
2009-04-22 05:40:57 +08:00
|
|
|
|
2008-04-29 03:29:42 +08:00
|
|
|
void *bdev_holder;
|
2011-01-06 19:30:25 +08:00
|
|
|
|
2011-03-08 21:14:00 +08:00
|
|
|
/* private scrub information */
|
|
|
|
struct mutex scrub_lock;
|
|
|
|
atomic_t scrubs_running;
|
|
|
|
atomic_t scrub_pause_req;
|
|
|
|
atomic_t scrubs_paused;
|
|
|
|
atomic_t scrub_cancel_req;
|
|
|
|
wait_queue_head_t scrub_pause_wait;
|
2019-02-12 23:51:18 +08:00
|
|
|
/*
|
|
|
|
* The worker pointers are NULL iff the refcount is 0, ie. scrub is not
|
|
|
|
* running.
|
|
|
|
*/
|
2019-01-30 14:45:02 +08:00
|
|
|
refcount_t scrub_workers_refcnt;
|
2022-04-18 12:43:10 +08:00
|
|
|
struct workqueue_struct *scrub_workers;
|
|
|
|
struct workqueue_struct *scrub_wr_completion_workers;
|
|
|
|
struct workqueue_struct *scrub_parity_workers;
|
2021-08-17 17:38:51 +08:00
|
|
|
struct btrfs_subpage_info *subpage_info;
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2019-12-14 08:22:14 +08:00
|
|
|
struct btrfs_discard_ctl discard_ctl;
|
|
|
|
|
2011-11-09 20:44:05 +08:00
|
|
|
#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
|
|
|
|
u32 check_integrity_print_mask;
|
|
|
|
#endif
|
2011-09-13 18:56:09 +08:00
|
|
|
/* is qgroup tracking in a consistent state? */
|
|
|
|
u64 qgroup_flags;
|
|
|
|
|
|
|
|
/* holds configuration and tracking. Protected by qgroup_lock */
|
|
|
|
struct rb_root qgroup_tree;
|
|
|
|
spinlock_t qgroup_lock;
|
|
|
|
|
2013-05-06 19:03:27 +08:00
|
|
|
/*
|
|
|
|
* used to avoid frequently calling ulist_alloc()/ulist_free()
|
|
|
|
* when doing qgroup accounting, it must be protected by qgroup_lock.
|
|
|
|
*/
|
|
|
|
struct ulist *qgroup_ulist;
|
|
|
|
|
btrfs: fix lockdep splat when enabling and disabling qgroups
When running test case btrfs/017 from fstests, lockdep reported the
following splat:
[ 1297.067385] ======================================================
[ 1297.067708] WARNING: possible circular locking dependency detected
[ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
[ 1297.068322] ------------------------------------------------------
[ 1297.068629] btrfs/189080 is trying to acquire lock:
[ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.069274]
but task is already holding lock:
[ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
[ 1297.070219]
which lock already depends on the new lock.
[ 1297.071131]
the existing dependency chain (in reverse order) is:
[ 1297.071721]
-> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
[ 1297.072375] lock_acquire+0xd8/0x490
[ 1297.072710] __mutex_lock+0xa3/0xb30
[ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
[ 1297.073421] create_subvol+0x194/0x990 [btrfs]
[ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
[ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
[ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
[ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs]
[ 1297.075245] __x64_sys_ioctl+0x83/0xb0
[ 1297.075617] do_syscall_64+0x33/0x80
[ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.076380]
-> #0 (sb_internal#2){.+.+}-{0:0}:
[ 1297.077166] check_prev_add+0x91/0xc60
[ 1297.077572] __lock_acquire+0x1740/0x3110
[ 1297.077984] lock_acquire+0xd8/0x490
[ 1297.078411] start_transaction+0x3c5/0x760 [btrfs]
[ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
[ 1297.079789] __x64_sys_ioctl+0x83/0xb0
[ 1297.080232] do_syscall_64+0x33/0x80
[ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.081139]
other info that might help us debug this:
[ 1297.082536] Possible unsafe locking scenario:
[ 1297.083510] CPU0 CPU1
[ 1297.084005] ---- ----
[ 1297.084500] lock(&fs_info->qgroup_ioctl_lock);
[ 1297.084994] lock(sb_internal#2);
[ 1297.085485] lock(&fs_info->qgroup_ioctl_lock);
[ 1297.085974] lock(sb_internal#2);
[ 1297.086454]
*** DEADLOCK ***
[ 1297.087880] 3 locks held by btrfs/189080:
[ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
[ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
[ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
[ 1297.089771]
stack backtrace:
[ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
[ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1297.092123] Call Trace:
[ 1297.092629] dump_stack+0x8d/0xb5
[ 1297.093115] check_noncircular+0xff/0x110
[ 1297.093596] check_prev_add+0x91/0xc60
[ 1297.094076] ? kvm_clock_read+0x14/0x30
[ 1297.094553] ? kvm_sched_clock_read+0x5/0x10
[ 1297.095029] __lock_acquire+0x1740/0x3110
[ 1297.095510] lock_acquire+0xd8/0x490
[ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.096476] start_transaction+0x3c5/0x760 [btrfs]
[ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
[ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
[ 1297.098904] ? do_user_addr_fault+0x20c/0x430
[ 1297.099382] ? kvm_clock_read+0x14/0x30
[ 1297.099854] ? kvm_sched_clock_read+0x5/0x10
[ 1297.100328] ? sched_clock+0x5/0x10
[ 1297.100801] ? sched_clock_cpu+0x12/0x180
[ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0
[ 1297.101739] __x64_sys_ioctl+0x83/0xb0
[ 1297.102207] do_syscall_64+0x33/0x80
[ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.103148] RIP: 0033:0x7f773ff65d87
This is because during the quota enable ioctl we lock first the mutex
qgroup_ioctl_lock and then start a transaction, and starting a transaction
acquires a fs freeze semaphore (at the VFS level). However, every other
code path, except for the quota disable ioctl path, we do the opposite:
we start a transaction and then lock the mutex.
So fix this by making the quota enable and disable paths to start the
transaction without having the mutex locked, and then, after starting the
transaction, lock the mutex and check if some other task already enabled
or disabled the quotas, bailing with success if that was the case.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-24 02:31:02 +08:00
|
|
|
/*
|
|
|
|
* Protect user change for quota operations. If a transaction is needed,
|
|
|
|
* it must be started before locking this lock.
|
|
|
|
*/
|
2013-04-07 18:50:16 +08:00
|
|
|
struct mutex qgroup_ioctl_lock;
|
|
|
|
|
2011-09-13 18:56:09 +08:00
|
|
|
/* list of dirty qgroups to be written at next commit */
|
|
|
|
struct list_head dirty_qgroups;
|
|
|
|
|
2015-04-17 10:23:16 +08:00
|
|
|
/* used by qgroup for an efficient tree traversal */
|
2011-09-13 18:56:09 +08:00
|
|
|
u64 qgroup_seq;
|
2011-11-09 20:44:05 +08:00
|
|
|
|
2013-04-26 00:04:51 +08:00
|
|
|
/* qgroup rescan items */
|
|
|
|
struct mutex qgroup_rescan_lock; /* protects the progress item */
|
|
|
|
struct btrfs_key qgroup_rescan_progress;
|
2014-02-28 10:46:19 +08:00
|
|
|
struct btrfs_workqueue *qgroup_rescan_workers;
|
2013-05-07 03:14:17 +08:00
|
|
|
struct completion qgroup_rescan_completion;
|
Btrfs: fix qgroup rescan resume on mount
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.
First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.
Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable
Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]
qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.
We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.
As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-28 23:47:24 +08:00
|
|
|
struct btrfs_work qgroup_rescan_work;
|
2016-08-16 00:10:33 +08:00
|
|
|
bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
|
btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()
Btrfs qgroup has a long history of bringing performance penalty in
btrfs_commit_transaction().
Although we tried our best to migrate such impact, there is still an
unsolved call site, btrfs_drop_snapshot().
This function will find the highest shared tree block and modify its
extent ownership to do a subvolume/snapshot dropping.
Such change will affect the whole subtree, and cause tons of qgroup
dirty extents and stall btrfs_commit_transaction().
To avoid such problem, here we introduce a new sysfs interface,
/sys/fs/btrfs/<uuid>/qgroups/drop_subptree_threshold, to determine at
whether and at which level we should skip qgroup accounting for subtree
dropping.
The default value is BTRFS_MAX_LEVEL, thus every subtree drop will go
through qgroup accounting, to ensure qgroup numbers are kept as
consistent as possible.
While for performance sensitive cases, add a way to change the values to
more reasonable values like 3, to make any subtree, which is at or higher
than level 3, to mark qgroup inconsistent and skip the accounting.
The cost is obvious, the qgroup number is no longer consistent, but at
least performance is more reasonable, and users have the control.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-24 09:14:09 +08:00
|
|
|
u8 qgroup_drop_subtree_thres;
|
2013-04-26 00:04:51 +08:00
|
|
|
|
2011-01-06 19:30:25 +08:00
|
|
|
/* filesystem state */
|
2013-01-29 18:14:48 +08:00
|
|
|
unsigned long fs_state;
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
|
|
|
|
struct btrfs_delayed_root *delayed_root;
|
2011-11-04 03:17:42 +08:00
|
|
|
|
2022-07-15 19:59:31 +08:00
|
|
|
/* Extent buffer radix tree */
|
2013-12-17 02:24:27 +08:00
|
|
|
spinlock_t buffer_lock;
|
2020-10-21 14:25:05 +08:00
|
|
|
/* Entries are eb->start / sectorsize */
|
2022-07-15 19:59:31 +08:00
|
|
|
struct radix_tree_root buffer_radix;
|
2013-12-17 02:24:27 +08:00
|
|
|
|
2011-11-04 03:17:42 +08:00
|
|
|
/* next backup root to be overwritten */
|
|
|
|
int backup_root_index;
|
2012-08-02 00:56:49 +08:00
|
|
|
|
2012-11-06 00:26:40 +08:00
|
|
|
/* device replace state */
|
|
|
|
struct btrfs_dev_replace dev_replace;
|
2012-11-06 00:54:08 +08:00
|
|
|
|
2013-08-15 23:11:21 +08:00
|
|
|
struct semaphore uuid_tree_rescan_sem;
|
Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-05-14 08:29:04 +08:00
|
|
|
|
|
|
|
/* Used to reclaim the metadata space in the background. */
|
|
|
|
struct work_struct async_reclaim_work;
|
2020-07-21 22:22:33 +08:00
|
|
|
struct work_struct async_data_reclaim_work;
|
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-09 21:28:22 +08:00
|
|
|
struct work_struct preempt_reclaim_work;
|
2014-09-18 23:20:02 +08:00
|
|
|
|
2021-04-19 15:41:02 +08:00
|
|
|
/* Reclaim partially filled block groups in the background */
|
|
|
|
struct work_struct reclaim_bgs_work;
|
|
|
|
struct list_head reclaim_bgs;
|
|
|
|
int bg_reclaim_threshold;
|
|
|
|
|
2014-09-18 23:20:02 +08:00
|
|
|
spinlock_t unused_bgs_lock;
|
|
|
|
struct list_head unused_bgs;
|
2015-01-30 03:18:25 +08:00
|
|
|
struct mutex unused_bg_unpin_mutex;
|
2021-04-19 15:41:01 +08:00
|
|
|
/* Protect block groups that are going to be deleted */
|
|
|
|
struct mutex reclaim_bgs_lock;
|
2014-09-23 13:40:08 +08:00
|
|
|
|
2016-06-15 21:22:56 +08:00
|
|
|
/* Cached block sizes */
|
|
|
|
u32 nodesize;
|
|
|
|
u32 sectorsize;
|
2020-07-02 02:45:04 +08:00
|
|
|
/* ilog2 of sectorsize, use to avoid 64bit division */
|
|
|
|
u32 sectorsize_bits;
|
2020-07-02 17:10:18 +08:00
|
|
|
u32 csum_size;
|
2020-07-02 16:54:11 +08:00
|
|
|
u32 csums_per_leaf;
|
2016-06-15 21:22:56 +08:00
|
|
|
u32 stripesize;
|
2017-09-30 03:43:50 +08:00
|
|
|
|
btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.
The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.
The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).
[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166] <TASK>
[75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520] ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431] ? memcpy+0x4e/0x60
[75721.781931] split_leaf+0x433/0x12d0 [btrfs]
[75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300] ? lock_downgrade+0x7c0/0x7c0
[75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313] ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085] ? lock_release+0x552/0xf80
[75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886] ? __kasan_check_write+0x14/0x20
[75721.895152] ? do_raw_read_unlock+0x44/0x80
[75721.901323] ? _raw_write_lock_irq+0x60/0x80
[75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166] ? _raw_write_unlock+0x23/0x40
[75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906] ? try_to_wake_up+0x30/0x14a0
[75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661] ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111] ? lock_acquire+0x41b/0x4c0
[75721.974982] finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184] ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643] process_one_work+0x815/0x1460
[75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643] ? do_raw_spin_trylock+0xbb/0x190
[75722.013086] worker_thread+0x59a/0xeb0
[75722.018511] kthread+0x2ac/0x360
[75722.023428] ? process_one_work+0x1460/0x1460
[75722.029431] ? kthread_complete_and_exit+0x30/0x30
[75722.036044] ret_from_fork+0x22/0x30
[75722.041255] </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---
To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.
Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.
CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-09 07:18:40 +08:00
|
|
|
/*
|
|
|
|
* Maximum size of an extent. BTRFS_MAX_EXTENT_SIZE on regular
|
|
|
|
* filesystem, on zoned it depends on the device constraints.
|
|
|
|
*/
|
|
|
|
u64 max_extent_size;
|
|
|
|
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
/* Block groups and devices containing active swapfiles. */
|
|
|
|
spinlock_t swapfile_pins_lock;
|
|
|
|
struct rb_root swapfile_pins;
|
|
|
|
|
2019-06-03 22:58:56 +08:00
|
|
|
struct crypto_shash *csum_shash;
|
|
|
|
|
2021-05-14 23:42:30 +08:00
|
|
|
/* Type of exclusive operation running, protected by super_lock */
|
|
|
|
enum btrfs_exclusive_operation exclusive_operation;
|
2020-08-25 23:02:32 +08:00
|
|
|
|
2020-11-10 19:26:08 +08:00
|
|
|
/*
|
|
|
|
* Zone size > 0 when in ZONED mode, otherwise it's used for a check
|
|
|
|
* if the mode is enabled
|
|
|
|
*/
|
2022-03-25 00:52:09 +08:00
|
|
|
u64 zone_size;
|
2020-11-10 19:26:08 +08:00
|
|
|
|
2022-07-09 07:18:39 +08:00
|
|
|
/* Max size to emit ZONE_APPEND write command */
|
|
|
|
u64 max_zone_append_size;
|
2021-02-04 18:22:08 +08:00
|
|
|
struct mutex zoned_meta_io_lock;
|
2021-02-04 18:22:18 +08:00
|
|
|
spinlock_t treelog_bg_lock;
|
|
|
|
u64 treelog_bg;
|
2020-11-10 19:26:09 +08:00
|
|
|
|
2021-09-09 00:19:26 +08:00
|
|
|
/*
|
|
|
|
* Start of the dedicated data relocation block group, protected by
|
|
|
|
* relocation_bg_lock.
|
|
|
|
*/
|
|
|
|
spinlock_t relocation_bg_lock;
|
|
|
|
u64 data_reloc_bg;
|
2022-04-18 15:15:03 +08:00
|
|
|
struct mutex zoned_data_reloc_io_lock;
|
2021-09-09 00:19:26 +08:00
|
|
|
|
2021-12-16 04:40:08 +08:00
|
|
|
u64 nr_global_roots;
|
|
|
|
|
2021-08-19 20:19:17 +08:00
|
|
|
spinlock_t zone_active_bgs_lock;
|
|
|
|
struct list_head zone_active_bgs;
|
|
|
|
|
2022-06-15 06:22:32 +08:00
|
|
|
/* Updates are not protected by any lock */
|
|
|
|
struct btrfs_commit_stats commit_stats;
|
|
|
|
|
btrfs: speedup checking for extent sharedness during fiemap
One of the most expensive tasks performed during fiemap is to check if
an extent is shared. This task has two major steps:
1) Check if the data extent is shared. This implies checking the extent
item in the extent tree, checking delayed references, etc. If we
find the data extent is directly shared, we terminate immediately;
2) If the data extent is not directly shared (its extent item has a
refcount of 1), then it may be shared if we have snapshots that share
subtrees of the inode's subvolume b+tree. So we check if the leaf
containing the file extent item is shared, then its parent node, then
the parent node of the parent node, etc, until we reach the root node
or we find one of them is shared - in which case we stop immediately.
During fiemap we process the extents of a file from left to right, from
file offset 0 to EOF. This means that we iterate b+tree leaves from left
to right, and has the implication that we keep repeating that second step
above several times for the same b+tree path of the inode's subvolume
b+tree.
For example, if we have two file extent items in leaf X, and the path to
leaf X is A -> B -> C -> X, then when we try to determine if the data
extent referenced by the first extent item is shared, we check if the data
extent is shared - if it's not, then we check if leaf X is shared, if not,
then we check if node C is shared, if not, then check if node B is shared,
if not than check if node A is shared. When we move to the next file
extent item, after determining the data extent is not shared, we repeat
the checks for X, C, B and A - doing all the expensive searches in the
extent tree, delayed refs, etc. If we have thousands of tile extents, then
we keep repeating the sharedness checks for the same paths over and over.
On a file that has no shared extents or only a small portion, it's easy
to see that this scales terribly with the number of extents in the file
and the sizes of the extent and subvolume b+trees.
This change eliminates the repeated sharedness check on extent buffers
by caching the results of the last path used. The results can be used as
long as no snapshots were created since they were cached (for not shared
extent buffers) or no roots were dropped since they were cached (for
shared extent buffers). This greatly reduces the time spent by fiemap for
files with thousands of extents and/or large extent and subvolume b+trees.
Example performance test:
$ cat fiemap-perf-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Before this patch:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)
After this patch:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1646 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 698 milliseconds (metadata cached)
That's about 2.2x faster when no metadata is cached, and about 3x faster
when all metadata is cached. On a real filesystem with many other files,
data, directories, etc, the b+trees will be 2 or 3 levels higher,
therefore this optimization will have a higher impact.
Several reports of a slow fiemap show up often, the two Link tags below
refer to two recent reports of such slowness. This patch, together with
the next ones in the series, is meant to address that.
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-01 21:18:28 +08:00
|
|
|
/*
|
|
|
|
* Last generation where we dropped a non-relocation root.
|
|
|
|
* Use btrfs_set_last_root_drop_gen() and btrfs_get_last_root_drop_gen()
|
|
|
|
* to change it and to read it, respectively.
|
|
|
|
*/
|
|
|
|
u64 last_root_drop_gen;
|
|
|
|
|
2022-07-26 06:11:48 +08:00
|
|
|
/*
|
|
|
|
* Annotations for transaction events (structures are empty when
|
|
|
|
* compiled without lockdep).
|
|
|
|
*/
|
|
|
|
struct lockdep_map btrfs_trans_num_writers_map;
|
2022-07-26 06:11:50 +08:00
|
|
|
struct lockdep_map btrfs_trans_num_extwriters_map;
|
2022-07-26 06:11:52 +08:00
|
|
|
struct lockdep_map btrfs_state_change_map[4];
|
2022-07-26 06:11:54 +08:00
|
|
|
struct lockdep_map btrfs_trans_pending_ordered_map;
|
2022-07-26 06:11:59 +08:00
|
|
|
struct lockdep_map btrfs_ordered_extent_map;
|
2022-07-26 06:11:48 +08:00
|
|
|
|
2017-09-30 03:43:50 +08:00
|
|
|
#ifdef CONFIG_BTRFS_FS_REF_VERIFY
|
|
|
|
spinlock_t ref_verify_lock;
|
|
|
|
struct rb_root block_tree;
|
|
|
|
#endif
|
2019-12-14 08:22:18 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
|
|
|
struct kobject *debug_kobj;
|
2020-01-24 22:33:00 +08:00
|
|
|
struct list_head allocated_roots;
|
2020-02-15 05:11:40 +08:00
|
|
|
|
|
|
|
spinlock_t eb_leak_lock;
|
|
|
|
struct list_head allocated_ebs;
|
2019-12-14 08:22:18 +08:00
|
|
|
#endif
|
2007-11-17 03:57:08 +08:00
|
|
|
};
|
2008-03-25 03:01:56 +08:00
|
|
|
|
btrfs: speedup checking for extent sharedness during fiemap
One of the most expensive tasks performed during fiemap is to check if
an extent is shared. This task has two major steps:
1) Check if the data extent is shared. This implies checking the extent
item in the extent tree, checking delayed references, etc. If we
find the data extent is directly shared, we terminate immediately;
2) If the data extent is not directly shared (its extent item has a
refcount of 1), then it may be shared if we have snapshots that share
subtrees of the inode's subvolume b+tree. So we check if the leaf
containing the file extent item is shared, then its parent node, then
the parent node of the parent node, etc, until we reach the root node
or we find one of them is shared - in which case we stop immediately.
During fiemap we process the extents of a file from left to right, from
file offset 0 to EOF. This means that we iterate b+tree leaves from left
to right, and has the implication that we keep repeating that second step
above several times for the same b+tree path of the inode's subvolume
b+tree.
For example, if we have two file extent items in leaf X, and the path to
leaf X is A -> B -> C -> X, then when we try to determine if the data
extent referenced by the first extent item is shared, we check if the data
extent is shared - if it's not, then we check if leaf X is shared, if not,
then we check if node C is shared, if not, then check if node B is shared,
if not than check if node A is shared. When we move to the next file
extent item, after determining the data extent is not shared, we repeat
the checks for X, C, B and A - doing all the expensive searches in the
extent tree, delayed refs, etc. If we have thousands of tile extents, then
we keep repeating the sharedness checks for the same paths over and over.
On a file that has no shared extents or only a small portion, it's easy
to see that this scales terribly with the number of extents in the file
and the sizes of the extent and subvolume b+trees.
This change eliminates the repeated sharedness check on extent buffers
by caching the results of the last path used. The results can be used as
long as no snapshots were created since they were cached (for not shared
extent buffers) or no roots were dropped since they were cached (for
shared extent buffers). This greatly reduces the time spent by fiemap for
files with thousands of extents and/or large extent and subvolume b+trees.
Example performance test:
$ cat fiemap-perf-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Before this patch:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)
After this patch:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1646 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 698 milliseconds (metadata cached)
That's about 2.2x faster when no metadata is cached, and about 3x faster
when all metadata is cached. On a real filesystem with many other files,
data, directories, etc, the b+trees will be 2 or 3 levels higher,
therefore this optimization will have a higher impact.
Several reports of a slow fiemap show up often, the two Link tags below
refer to two recent reports of such slowness. This patch, together with
the next ones in the series, is meant to address that.
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-01 21:18:28 +08:00
|
|
|
static inline void btrfs_set_last_root_drop_gen(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 gen)
|
|
|
|
{
|
|
|
|
WRITE_ONCE(fs_info->last_root_drop_gen, gen);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u64 btrfs_get_last_root_drop_gen(const struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
return READ_ONCE(fs_info->last_root_drop_gen);
|
|
|
|
}
|
|
|
|
|
2016-06-15 21:22:56 +08:00
|
|
|
static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
|
|
|
|
{
|
|
|
|
return sb->s_fs_info;
|
|
|
|
}
|
|
|
|
|
2022-09-15 07:04:46 +08:00
|
|
|
/*
|
|
|
|
* Take the number of bytes to be checksummed and figure out how many leaves
|
|
|
|
* it would require to store the csums for that many bytes.
|
|
|
|
*/
|
|
|
|
static inline u64 btrfs_csum_bytes_to_leaves(
|
|
|
|
const struct btrfs_fs_info *fs_info, u64 csum_bytes)
|
|
|
|
{
|
|
|
|
const u64 num_csums = csum_bytes >> fs_info->sectorsize_bits;
|
|
|
|
|
|
|
|
return DIV_ROUND_UP_ULL(num_csums, fs_info->csums_per_leaf);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use this if we would be adding new items, as we could split nodes as we cow
|
|
|
|
* down the tree.
|
|
|
|
*/
|
|
|
|
static inline u64 btrfs_calc_insert_metadata_size(struct btrfs_fs_info *fs_info,
|
|
|
|
unsigned num_items)
|
|
|
|
{
|
|
|
|
return (u64)fs_info->nodesize * BTRFS_MAX_LEVEL * 2 * num_items;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Doing a truncate or a modification won't result in new nodes or leaves, just
|
|
|
|
* what we need for COW.
|
|
|
|
*/
|
|
|
|
static inline u64 btrfs_calc_metadata_size(struct btrfs_fs_info *fs_info,
|
|
|
|
unsigned num_items)
|
|
|
|
{
|
|
|
|
return (u64)fs_info->nodesize * BTRFS_MAX_LEVEL * num_items;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define BTRFS_MAX_EXTENT_ITEM_SIZE(r) ((BTRFS_LEAF_DATA_SIZE(r->fs_info) >> 4) - \
|
|
|
|
sizeof(struct btrfs_item))
|
|
|
|
|
|
|
|
static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
return fs_info->zone_size > 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Count how many fs_info->max_extent_size cover the @size
|
|
|
|
*/
|
|
|
|
static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
if (!fs_info)
|
|
|
|
return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool btrfs_exclop_start(struct btrfs_fs_info *fs_info,
|
|
|
|
enum btrfs_exclusive_operation type);
|
|
|
|
bool btrfs_exclop_start_try_lock(struct btrfs_fs_info *fs_info,
|
|
|
|
enum btrfs_exclusive_operation type);
|
|
|
|
void btrfs_exclop_start_unlock(struct btrfs_fs_info *fs_info);
|
|
|
|
void btrfs_exclop_finish(struct btrfs_fs_info *fs_info);
|
|
|
|
void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
|
|
|
|
enum btrfs_exclusive_operation op);
|
|
|
|
|
2014-04-02 19:51:05 +08:00
|
|
|
/*
|
|
|
|
* The state of btrfs root
|
|
|
|
*/
|
2018-11-27 21:57:19 +08:00
|
|
|
enum {
|
|
|
|
/*
|
|
|
|
* btrfs_record_root_in_trans is a multi-step process, and it can race
|
|
|
|
* with the balancing code. But the race is very small, and only the
|
|
|
|
* first time the root is added to each transaction. So IN_TRANS_SETUP
|
|
|
|
* is used to tell us when more checks are required
|
|
|
|
*/
|
|
|
|
BTRFS_ROOT_IN_TRANS_SETUP,
|
2020-05-15 14:01:40 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Set if tree blocks of this root can be shared by other roots.
|
|
|
|
* Only subvolume trees and their reloc trees have this bit set.
|
|
|
|
* Conflicts with TRACK_DIRTY bit.
|
|
|
|
*
|
|
|
|
* This affects two things:
|
|
|
|
*
|
|
|
|
* - How balance works
|
|
|
|
* For shareable roots, we need to use reloc tree and do path
|
|
|
|
* replacement for balance, and need various pre/post hooks for
|
|
|
|
* snapshot creation to handle them.
|
|
|
|
*
|
|
|
|
* While for non-shareable trees, we just simply do a tree search
|
|
|
|
* with COW.
|
|
|
|
*
|
|
|
|
* - How dirty roots are tracked
|
|
|
|
* For shareable roots, btrfs_record_root_in_trans() is needed to
|
|
|
|
* track them, while non-subvolume roots have TRACK_DIRTY bit, they
|
|
|
|
* don't need to set this manually.
|
|
|
|
*/
|
|
|
|
BTRFS_ROOT_SHAREABLE,
|
2018-11-27 21:57:19 +08:00
|
|
|
BTRFS_ROOT_TRACK_DIRTY,
|
2022-07-15 19:59:21 +08:00
|
|
|
BTRFS_ROOT_IN_RADIX,
|
2018-11-27 21:57:19 +08:00
|
|
|
BTRFS_ROOT_ORPHAN_ITEM_INSERTED,
|
|
|
|
BTRFS_ROOT_DEFRAG_RUNNING,
|
|
|
|
BTRFS_ROOT_FORCE_COW,
|
|
|
|
BTRFS_ROOT_MULTI_LOG_TASKS,
|
|
|
|
BTRFS_ROOT_DIRTY,
|
2018-12-01 00:52:13 +08:00
|
|
|
BTRFS_ROOT_DELETING,
|
2019-01-23 15:15:14 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Reloc tree is orphan, only kept here for qgroup delayed subtree scan
|
|
|
|
*
|
|
|
|
* Set for the subvolume tree owning the reloc tree.
|
|
|
|
*/
|
|
|
|
BTRFS_ROOT_DEAD_RELOC_TREE,
|
2019-02-07 04:46:14 +08:00
|
|
|
/* Mark dead root stored on device whose cleanup needs to be resumed */
|
|
|
|
BTRFS_ROOT_DEAD_TREE,
|
btrfs: do not block inode logging for so long during transaction commit
Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.
This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).
Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:
1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers
They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash.
So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
The following script that uses dbench was used to measure the impact of
the whole patchset:
$ cat test-dbench.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).
Before patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 11277211 0.250 85.340
Close 8283172 0.002 6.479
Rename 477515 1.935 86.026
Unlink 2277936 0.770 87.071
Deltree 256 15.732 81.379
Mkdir 128 0.003 0.009
Qpathinfo 10221180 0.056 44.404
Qfileinfo 1789967 0.002 4.066
Qfsinfo 1874399 0.003 9.176
Sfileinfo 918589 0.061 10.247
Find 3951758 0.341 54.040
WriteX 5616547 0.047 85.079
ReadX 17676028 0.005 9.704
LockX 36704 0.003 1.800
UnlockX 36704 0.002 0.687
Flush 790541 14.115 676.236
Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms
After patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 12687926 0.171 86.526
Close 9320780 0.002 8.063
Rename 537253 1.444 78.576
Unlink 2561827 0.559 87.228
Deltree 374 11.499 73.549
Mkdir 187 0.003 0.005
Qpathinfo 11500300 0.061 36.801
Qfileinfo 2017118 0.002 7.189
Qfsinfo 2108641 0.003 4.825
Sfileinfo 1033574 0.008 8.065
Find 4446553 0.408 47.835
WriteX 6335667 0.045 84.388
ReadX 19887312 0.003 9.215
LockX 41312 0.003 1.394
UnlockX 41312 0.002 1.425
Flush 889233 13.014 623.259
Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms
+12.7% throughput, -8.2% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-25 20:19:28 +08:00
|
|
|
/* The root has a log tree. Used for subvolume roots and the tree root. */
|
2020-06-15 17:38:44 +08:00
|
|
|
BTRFS_ROOT_HAS_LOG_TREE,
|
btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space. One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.
btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
--- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800
+++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800
@@ -1,2 +1,5 @@
QA output created by 153
+pwrite: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
Silence is golden
...
(Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.
Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.
For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.
This leads to the -EDQUOT in buffered write routine.
And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.
[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:
- Flush all inodes of the root
NODATACOW writes will free the qgroup reserved at run_dealloc_range().
However we don't have the infrastructure to only flush NODATACOW
inodes, here we flush all inodes anyway.
- Wait for ordered extents
This would convert the preallocated metadata space into per-trans
metadata, which can be freed in later transaction commit.
- Commit transaction
This will free all per-trans metadata space.
Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.
Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-13 18:50:48 +08:00
|
|
|
/* Qgroup flushing is in progress */
|
|
|
|
BTRFS_ROOT_QGROUP_FLUSHING,
|
2021-11-09 23:12:06 +08:00
|
|
|
/* We started the orphan cleanup for this root. */
|
|
|
|
BTRFS_ROOT_ORPHAN_CLEANUP,
|
2022-02-19 03:56:10 +08:00
|
|
|
/* This root has a drop operation that was started previously. */
|
|
|
|
BTRFS_ROOT_UNFINISHED_DROP,
|
btrfs: fix lockdep splat with reloc root extent buffers
We have been hitting the following lockdep splat with btrfs/187 recently
WARNING: possible circular locking dependency detected
5.19.0-rc8+ #775 Not tainted
------------------------------------------------------
btrfs/752500 is trying to acquire lock:
ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
but task is already holding lock:
ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_init_new_buffer+0x7d/0x2c0
btrfs_alloc_tree_block+0x120/0x3b0
__btrfs_cow_block+0x136/0x600
btrfs_cow_block+0x10b/0x230
btrfs_search_slot+0x53b/0xb70
btrfs_lookup_inode+0x2a/0xa0
__btrfs_update_delayed_inode+0x5f/0x280
btrfs_async_run_delayed_root+0x24c/0x290
btrfs_work_helper+0xf2/0x3e0
process_one_work+0x271/0x590
worker_thread+0x52/0x3b0
kthread+0xf0/0x120
ret_from_fork+0x1f/0x30
-> #1 (btrfs-tree-01){++++}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_search_slot+0x3c3/0xb70
do_relocation+0x10c/0x6b0
relocate_tree_blocks+0x317/0x6d0
relocate_block_group+0x1f1/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-01/1);
lock(btrfs-tree-01);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02#2);
*** DEADLOCK ***
7 locks held by btrfs/752500:
#0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
#1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
#2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
#3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
#4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
stack backtrace:
CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x56/0x73
check_noncircular+0xd6/0x100
? lock_is_held_type+0xe2/0x140
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
? __btrfs_tree_lock+0x24/0x110
down_write_nested+0x41/0x80
? __btrfs_tree_lock+0x24/0x110
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
? lock_release+0x137/0x2d0
? _raw_spin_unlock+0x29/0x50
? release_extent_buffer+0x128/0x180
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
? lock_is_held_type+0xe2/0x140
? lock_is_held_type+0xe2/0x140
? __x64_sys_ioctl+0x88/0xc0
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
This isn't necessarily new, it's just tricky to hit in practice. There
are two competing things going on here. With relocation we create a
snapshot of every fs tree with a reloc tree. Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set. This creates the lock dependency of
reloc tree -> normal tree
for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.
However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root. This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block. We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace. This creates the dependency of
normal tree -> reloc tree
which is why lockdep complains.
Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.
Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.
This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of
normal tree -> reloc tree
We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key. This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.
With this patch we no longer have the lockdep splat in btrfs/187.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-27 04:24:04 +08:00
|
|
|
/* This reloc root needs to have its buffers lockdep class reset. */
|
|
|
|
BTRFS_ROOT_RESET_LOCKDEP_CLASS,
|
2018-11-27 21:57:19 +08:00
|
|
|
};
|
2014-04-02 19:51:05 +08:00
|
|
|
|
2022-07-26 06:11:52 +08:00
|
|
|
enum btrfs_lockdep_trans_states {
|
|
|
|
BTRFS_LOCKDEP_TRANS_COMMIT_START,
|
|
|
|
BTRFS_LOCKDEP_TRANS_UNBLOCKED,
|
|
|
|
BTRFS_LOCKDEP_TRANS_SUPER_COMMITTED,
|
|
|
|
BTRFS_LOCKDEP_TRANS_COMPLETED,
|
|
|
|
};
|
|
|
|
|
btrfs: add macros for annotating wait events with lockdep
Introduce four macros that are used to annotate wait events in btrfs code
with lockdep;
1) the btrfs_lockdep_init_map
2) the btrfs_lockdep_acquire,
3) the btrfs_lockdep_release
4) the btrfs_might_wait_for_event macros.
The btrfs_lockdep_init_map macro is used to initialize a lockdep map.
The btrfs_lockdep_<acquire,release> macros are used by threads to take
the lockdep map as readers (shared lock) and release it, respectively.
The btrfs_might_wait_for_event macro is used by threads to take the
lockdep map as writers (exclusive lock) and release it.
In general, the lockdep annotation for wait events work as follows:
The condition for a wait event can be modified and signaled at the same
time by multiple threads. These threads hold the lockdep map as readers
when they enter a context in which blocking would prevent signaling the
condition. Frequently, this occurs when a thread violates a condition
(lockdep map acquire), before restoring it and signaling it at a later
point (lockdep map release).
The threads that block on the wait event take the lockdep map as writers
(exclusive lock). These threads have to block until all the threads that
hold the lockdep map as readers signal the condition for the wait event
and release the lockdep map.
The lockdep annotation is used to warn about potential deadlock scenarios
that involve the threads that modify and signal the wait event condition
and threads that block on the wait event. A simple example is illustrated
below:
Without lockdep:
TA TB
cond = false
lock(A)
wait_event(w, cond)
unlock(A)
lock(A)
cond = true
signal(w)
unlock(A)
With lockdep:
TA TB
rwsem_acquire_read(lockdep_map)
cond = false
lock(A)
rwsem_acquire(lockdep_map)
rwsem_release(lockdep_map)
wait_event(w, cond)
unlock(A)
lock(A)
cond = true
signal(w)
unlock(A)
rwsem_release(lockdep_map)
In the second case, with the lockdep annotation, lockdep would warn about
an ABBA deadlock, while the first case would just deadlock at some point.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-26 06:11:46 +08:00
|
|
|
/*
|
|
|
|
* Lockdep annotation for wait events.
|
|
|
|
*
|
|
|
|
* @owner: The struct where the lockdep map is defined
|
|
|
|
* @lock: The lockdep map corresponding to a wait event
|
|
|
|
*
|
|
|
|
* This macro is used to annotate a wait event. In this case a thread acquires
|
|
|
|
* the lockdep map as writer (exclusive lock) because it has to block until all
|
|
|
|
* the threads that hold the lock as readers signal the condition for the wait
|
|
|
|
* event and release their locks.
|
|
|
|
*/
|
|
|
|
#define btrfs_might_wait_for_event(owner, lock) \
|
|
|
|
do { \
|
|
|
|
rwsem_acquire(&owner->lock##_map, 0, 0, _THIS_IP_); \
|
|
|
|
rwsem_release(&owner->lock##_map, _THIS_IP_); \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Protection for the resource/condition of a wait event.
|
|
|
|
*
|
|
|
|
* @owner: The struct where the lockdep map is defined
|
|
|
|
* @lock: The lockdep map corresponding to a wait event
|
|
|
|
*
|
|
|
|
* Many threads can modify the condition for the wait event at the same time
|
|
|
|
* and signal the threads that block on the wait event. The threads that modify
|
|
|
|
* the condition and do the signaling acquire the lock as readers (shared
|
|
|
|
* lock).
|
|
|
|
*/
|
|
|
|
#define btrfs_lockdep_acquire(owner, lock) \
|
|
|
|
rwsem_acquire_read(&owner->lock##_map, 0, 0, _THIS_IP_)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Used after signaling the condition for a wait event to release the lockdep
|
|
|
|
* map held by a reader thread.
|
|
|
|
*/
|
|
|
|
#define btrfs_lockdep_release(owner, lock) \
|
|
|
|
rwsem_release(&owner->lock##_map, _THIS_IP_)
|
|
|
|
|
2022-07-26 06:11:52 +08:00
|
|
|
/*
|
|
|
|
* Macros for the transaction states wait events, similar to the generic wait
|
|
|
|
* event macros.
|
|
|
|
*/
|
|
|
|
#define btrfs_might_wait_for_state(owner, i) \
|
|
|
|
do { \
|
|
|
|
rwsem_acquire(&owner->btrfs_state_change_map[i], 0, 0, _THIS_IP_); \
|
|
|
|
rwsem_release(&owner->btrfs_state_change_map[i], _THIS_IP_); \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define btrfs_trans_state_lockdep_acquire(owner, i) \
|
|
|
|
rwsem_acquire_read(&owner->btrfs_state_change_map[i], 0, 0, _THIS_IP_)
|
|
|
|
|
|
|
|
#define btrfs_trans_state_lockdep_release(owner, i) \
|
|
|
|
rwsem_release(&owner->btrfs_state_change_map[i], _THIS_IP_)
|
|
|
|
|
btrfs: add macros for annotating wait events with lockdep
Introduce four macros that are used to annotate wait events in btrfs code
with lockdep;
1) the btrfs_lockdep_init_map
2) the btrfs_lockdep_acquire,
3) the btrfs_lockdep_release
4) the btrfs_might_wait_for_event macros.
The btrfs_lockdep_init_map macro is used to initialize a lockdep map.
The btrfs_lockdep_<acquire,release> macros are used by threads to take
the lockdep map as readers (shared lock) and release it, respectively.
The btrfs_might_wait_for_event macro is used by threads to take the
lockdep map as writers (exclusive lock) and release it.
In general, the lockdep annotation for wait events work as follows:
The condition for a wait event can be modified and signaled at the same
time by multiple threads. These threads hold the lockdep map as readers
when they enter a context in which blocking would prevent signaling the
condition. Frequently, this occurs when a thread violates a condition
(lockdep map acquire), before restoring it and signaling it at a later
point (lockdep map release).
The threads that block on the wait event take the lockdep map as writers
(exclusive lock). These threads have to block until all the threads that
hold the lockdep map as readers signal the condition for the wait event
and release the lockdep map.
The lockdep annotation is used to warn about potential deadlock scenarios
that involve the threads that modify and signal the wait event condition
and threads that block on the wait event. A simple example is illustrated
below:
Without lockdep:
TA TB
cond = false
lock(A)
wait_event(w, cond)
unlock(A)
lock(A)
cond = true
signal(w)
unlock(A)
With lockdep:
TA TB
rwsem_acquire_read(lockdep_map)
cond = false
lock(A)
rwsem_acquire(lockdep_map)
rwsem_release(lockdep_map)
wait_event(w, cond)
unlock(A)
lock(A)
cond = true
signal(w)
unlock(A)
rwsem_release(lockdep_map)
In the second case, with the lockdep annotation, lockdep would warn about
an ABBA deadlock, while the first case would just deadlock at some point.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-26 06:11:46 +08:00
|
|
|
/* Initialization of the lockdep map */
|
|
|
|
#define btrfs_lockdep_init_map(owner, lock) \
|
|
|
|
do { \
|
|
|
|
static struct lock_class_key lock##_key; \
|
|
|
|
lockdep_init_map(&owner->lock##_map, #lock, &lock##_key, 0); \
|
|
|
|
} while (0)
|
|
|
|
|
2022-07-26 06:11:52 +08:00
|
|
|
/* Initialization of the transaction states lockdep maps. */
|
|
|
|
#define btrfs_state_lockdep_init_map(owner, lock, state) \
|
|
|
|
do { \
|
|
|
|
static struct lock_class_key lock##_key; \
|
|
|
|
lockdep_init_map(&owner->btrfs_state_change_map[state], #lock, \
|
|
|
|
&lock##_key, 0); \
|
|
|
|
} while (0)
|
|
|
|
|
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
|
|
|
/*
|
|
|
|
* Record swapped tree blocks of a subvolume tree for delayed subtree trace
|
|
|
|
* code. For detail check comment in fs/btrfs/qgroup.c.
|
|
|
|
*/
|
|
|
|
struct btrfs_qgroup_swapped_blocks {
|
|
|
|
spinlock_t lock;
|
|
|
|
/* RM_EMPTY_ROOT() of above blocks[] */
|
|
|
|
bool swapped;
|
|
|
|
struct rb_root blocks[BTRFS_MAX_LEVEL];
|
|
|
|
};
|
|
|
|
|
2007-03-21 02:38:32 +08:00
|
|
|
/*
|
|
|
|
* in ram representation of the tree. extent_root is used for all allocations
|
2007-04-26 03:52:25 +08:00
|
|
|
* and for the extent tree extent_root root.
|
2007-03-21 02:38:32 +08:00
|
|
|
*/
|
|
|
|
struct btrfs_root {
|
2021-11-06 04:45:51 +08:00
|
|
|
struct rb_node rb_node;
|
|
|
|
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *node;
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *commit_root;
|
2008-09-06 04:13:11 +08:00
|
|
|
struct btrfs_root *log_root;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
struct btrfs_root *reloc_root;
|
2008-07-29 03:32:19 +08:00
|
|
|
|
2014-04-02 19:51:05 +08:00
|
|
|
unsigned long state;
|
2007-03-16 00:56:47 +08:00
|
|
|
struct btrfs_root_item root_item;
|
|
|
|
struct btrfs_key root_key;
|
2007-03-21 02:38:32 +08:00
|
|
|
struct btrfs_fs_info *fs_info;
|
2008-09-12 04:17:57 +08:00
|
|
|
struct extent_io_tree dirty_log_pages;
|
|
|
|
|
2008-06-26 04:01:30 +08:00
|
|
|
struct mutex objectid_mutex;
|
2009-01-22 01:54:03 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
spinlock_t accounting_lock;
|
|
|
|
struct btrfs_block_rsv *block_rsv;
|
|
|
|
|
2008-09-06 04:13:11 +08:00
|
|
|
struct mutex log_mutex;
|
2009-01-22 01:54:03 +08:00
|
|
|
wait_queue_head_t log_writer_wait;
|
|
|
|
wait_queue_head_t log_commit_wait[2];
|
2014-02-20 18:08:58 +08:00
|
|
|
struct list_head log_ctxs[2];
|
btrfs: remove no longer needed use of log_writers for the log root tree
When syncing the log, we used to update the log root tree without holding
neither the log_mutex of the subvolume root nor the log_mutex of log root
tree.
We used to have two critical sections delimited by the log_mutex of the
log root tree, so in the first one we incremented the log_writers of the
log root tree and on the second one we decremented it and waited for the
log_writers counter to go down to zero. This was because the update of
the log root tree happened between the two critical sections.
The use of two critical sections allowed a little bit more of parallelism
and required the use of the log_writers counter, necessary to make sure
we didn't miss any log root tree update when we have multiple tasks trying
to sync the log in parallel.
However after commit 06989c799f0481 ("Btrfs: fix race updating log root
item during fsync") the log root tree update was moved into a critical
section delimited by the subvolume's log_mutex. Later another commit
moved the log tree update from that critical section into the second
critical section delimited by the log_mutex of the log root tree. Both
commits addressed different bugs.
The end result is that the first critical section delimited by the
log_mutex of the log root tree became pointless, since there's nothing
done between it and the second critical section, we just have an unlock
of the log_mutex followed by a lock operation. This means we can merge
both critical sections, as the first one does almost nothing now, and we
can stop using the log_writers counter of the log root tree, which was
incremented in the first critical section and decremented in the second
criticial section, used to make sure no one in the second critical section
started writeback of the log root tree before some other task updated it.
So just remove the mutex_unlock() followed by mutex_lock() of the log root
tree, as well as the use of the log_writers counter for the log root tree.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-02 19:32:40 +08:00
|
|
|
/* Used only for log trees of subvolumes, not for the log root tree */
|
2009-01-22 01:54:03 +08:00
|
|
|
atomic_t log_writers;
|
|
|
|
atomic_t log_commit[2];
|
2020-07-02 19:32:31 +08:00
|
|
|
/* Used only for log trees of subvolumes, not for the log root tree */
|
2012-09-06 18:04:27 +08:00
|
|
|
atomic_t log_batch;
|
2014-02-20 18:08:56 +08:00
|
|
|
int log_transid;
|
2014-02-20 18:08:59 +08:00
|
|
|
/* No matter the commit succeeds or not*/
|
|
|
|
int log_transid_committed;
|
|
|
|
/* Just be updated when the commit succeeds. */
|
2014-02-20 18:08:56 +08:00
|
|
|
int last_log_commit;
|
2009-10-09 03:30:04 +08:00
|
|
|
pid_t log_start_pid;
|
2008-08-05 11:17:27 +08:00
|
|
|
|
2007-04-09 22:42:37 +08:00
|
|
|
u64 last_trans;
|
2007-10-16 04:14:19 +08:00
|
|
|
|
2007-03-21 02:38:32 +08:00
|
|
|
u32 type;
|
2009-09-22 03:56:00 +08:00
|
|
|
|
2020-12-07 23:32:35 +08:00
|
|
|
u64 free_objectid;
|
2011-06-14 08:00:16 +08:00
|
|
|
|
2007-08-08 04:15:09 +08:00
|
|
|
struct btrfs_key defrag_progress;
|
2008-05-25 02:04:53 +08:00
|
|
|
struct btrfs_key defrag_max;
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2020-05-15 14:01:40 +08:00
|
|
|
/* The dirty list is only used by non-shareable roots */
|
2008-03-25 03:01:56 +08:00
|
|
|
struct list_head dirty_list;
|
2008-07-25 00:17:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct list_head root_list;
|
|
|
|
|
2012-10-13 03:27:49 +08:00
|
|
|
spinlock_t log_extents_lock[2];
|
|
|
|
struct list_head logged_list[2];
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
spinlock_t inode_lock;
|
|
|
|
/* red-black tree that keeps track of in-memory inodes */
|
|
|
|
struct rb_root inode_tree;
|
|
|
|
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
/*
|
2022-07-15 19:59:45 +08:00
|
|
|
* radix tree that keeps track of delayed nodes of every inode,
|
|
|
|
* protected by inode_lock
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
*/
|
2022-07-15 19:59:45 +08:00
|
|
|
struct radix_tree_root delayed_nodes_tree;
|
2008-11-18 09:42:26 +08:00
|
|
|
/*
|
|
|
|
* right now this just gets used so that a root has its own devid
|
|
|
|
* for stat. It may be used for more later
|
|
|
|
*/
|
2011-07-08 03:44:25 +08:00
|
|
|
dev_t anon_dev;
|
2011-11-15 09:48:06 +08:00
|
|
|
|
2012-12-07 17:28:54 +08:00
|
|
|
spinlock_t root_item_lock;
|
2017-03-03 16:55:18 +08:00
|
|
|
refcount_t refs;
|
2013-05-15 15:48:22 +08:00
|
|
|
|
2014-03-06 13:55:03 +08:00
|
|
|
struct mutex delalloc_mutex;
|
2013-05-15 15:48:22 +08:00
|
|
|
spinlock_t delalloc_lock;
|
|
|
|
/*
|
|
|
|
* all of the inodes that have delalloc bytes. It is possible for
|
|
|
|
* this list to be empty even when there is still dirty data=ordered
|
|
|
|
* extents waiting to finish IO.
|
|
|
|
*/
|
|
|
|
struct list_head delalloc_inodes;
|
|
|
|
struct list_head delalloc_root;
|
|
|
|
u64 nr_delalloc_inodes;
|
2014-03-06 13:55:02 +08:00
|
|
|
|
|
|
|
struct mutex ordered_extent_mutex;
|
2013-05-15 15:48:23 +08:00
|
|
|
/*
|
|
|
|
* this is used by the balancing code to wait for all the pending
|
|
|
|
* ordered extents
|
|
|
|
*/
|
|
|
|
spinlock_t ordered_extent_lock;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* all of the data=ordered extents pending writeback
|
|
|
|
* these can span multiple transactions and basically include
|
|
|
|
* every dirty data page that isn't from nodatacow
|
|
|
|
*/
|
|
|
|
struct list_head ordered_extents;
|
|
|
|
struct list_head ordered_root;
|
|
|
|
u64 nr_ordered_extents;
|
2013-12-17 00:34:17 +08:00
|
|
|
|
2019-01-23 15:15:14 +08:00
|
|
|
/*
|
|
|
|
* Not empty if this subvolume root has gone through tree block swap
|
|
|
|
* (relocation)
|
|
|
|
*
|
|
|
|
* Will be used by reloc_control::dirty_subvol_roots.
|
|
|
|
*/
|
|
|
|
struct list_head reloc_dirty_list;
|
|
|
|
|
2013-12-17 00:34:17 +08:00
|
|
|
/*
|
|
|
|
* Number of currently running SEND ioctls to prevent
|
|
|
|
* manipulation with the read-only status via SUBVOL_SETFLAGS
|
|
|
|
*/
|
|
|
|
int send_in_progress;
|
Btrfs: fix race between send and deduplication that lead to failures and crashes
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-22 23:43:42 +08:00
|
|
|
/*
|
|
|
|
* Number of currently running deduplication operations that have a
|
|
|
|
* destination inode belonging to this root. Protected by the lock
|
|
|
|
* root_item_lock.
|
|
|
|
*/
|
|
|
|
int dedupe_in_progress;
|
2020-01-30 20:59:45 +08:00
|
|
|
/* For exclusion of snapshot creation and nocow writes */
|
|
|
|
struct btrfs_drew_lock snapshot_lock;
|
|
|
|
|
Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced
nocow writes to fallback to COW, during writeback, when a snapshot is
created. This resulted in writes made before creating the snapshot to
unexpectedly fail with ENOSPC during writeback when success (0) was
returned to user space through the write system call.
The steps leading to this problem are:
1. When it's not possible to allocate data space for a write, the
buffered write path checks if a NOCOW write is possible. If it is,
it will not reserve space and success (0) is returned to user space.
2. Then when a snapshot is created, the root's will_be_snapshotted
atomic is incremented and writeback is triggered for all inode's that
belong to the root being snapshotted. Incrementing that atomic forces
all previous writes to fallback to COW during writeback (running
delalloc).
3. This results in the writeback for the inodes to fail and therefore
setting the ENOSPC error in their mappings, so that a subsequent
fsync on them will report the error to user space. So it's not a
completely silent data loss (since fsync will report ENOSPC) but it's
a very unexpected and undesirable behaviour, because if a clean
shutdown/unmount of the filesystem happens without previous calls to
fsync, it is expected to have the data present in the files after
mounting the filesystem again.
So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:
1. It is incremented when we start to create a snapshot after triggering
writeback and before waiting for writeback to finish.
2. This new atomic is now what is used by writeback (running delalloc)
to decide whether we need to fallback to COW or not. Because we
incremented this new atomic after triggering writeback in the
snapshot creation ioctl, we ensure that all buffered writes that
happened before snapshot creation will succeed and not fallback to
COW (which would make them fail with ENOSPC).
3. The existing atomic, will_be_snapshotted, is kept because it is used
to force new buffered writes, that start after we started
snapshotting, to reserve data space even when NOCOW is possible.
This makes these writes fail early with ENOSPC when there's no
available space to allocate, preventing the unexpected behaviour of
writeback later failing with ENOSPC due to a fallback to COW mode.
Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 10:30:30 +08:00
|
|
|
atomic_t snapshot_force_cow;
|
2017-12-12 15:34:34 +08:00
|
|
|
|
|
|
|
/* For qgroup metadata reserved space */
|
|
|
|
spinlock_t qgroup_meta_rsv_lock;
|
|
|
|
u64 qgroup_meta_rsv_pertrans;
|
|
|
|
u64 qgroup_meta_rsv_prealloc;
|
btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space. One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.
btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
--- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800
+++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800
@@ -1,2 +1,5 @@
QA output created by 153
+pwrite: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
Silence is golden
...
(Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.
Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.
For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.
This leads to the -EDQUOT in buffered write routine.
And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.
[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:
- Flush all inodes of the root
NODATACOW writes will free the qgroup reserved at run_dealloc_range().
However we don't have the infrastructure to only flush NODATACOW
inodes, here we flush all inodes anyway.
- Wait for ordered extents
This would convert the preallocated metadata space into per-trans
metadata, which can be freed in later transaction commit.
- Commit transaction
This will free all per-trans metadata space.
Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.
Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-13 18:50:48 +08:00
|
|
|
wait_queue_head_t qgroup_flush_wait;
|
2018-08-17 23:38:12 +08:00
|
|
|
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
/* Number of active swapfiles */
|
|
|
|
atomic_t nr_swapfiles;
|
|
|
|
|
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-23 15:15:16 +08:00
|
|
|
/* Record pairs of swapped blocks for qgroup */
|
|
|
|
struct btrfs_qgroup_swapped_blocks swapped_blocks;
|
|
|
|
|
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 19:14:50 +08:00
|
|
|
/* Used only by log trees, when logging csum items */
|
|
|
|
struct extent_io_tree log_csum_range;
|
|
|
|
|
2018-08-17 23:38:12 +08:00
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
u64 alloc_bytenr;
|
|
|
|
#endif
|
2020-01-24 22:33:00 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
|
|
|
struct list_head leak_list;
|
|
|
|
#endif
|
2007-03-16 00:56:47 +08:00
|
|
|
};
|
2017-05-22 18:16:11 +08:00
|
|
|
|
2020-09-08 18:27:22 +08:00
|
|
|
/*
|
|
|
|
* Structure that conveys information about an extent that is going to replace
|
|
|
|
* all the extents in a file range.
|
|
|
|
*/
|
|
|
|
struct btrfs_replace_extent_info {
|
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-05 18:09:50 +08:00
|
|
|
u64 disk_offset;
|
|
|
|
u64 disk_len;
|
|
|
|
u64 data_offset;
|
|
|
|
u64 data_len;
|
|
|
|
u64 file_offset;
|
2020-09-08 18:27:21 +08:00
|
|
|
/* Pointer to a file extent item of type regular or prealloc. */
|
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-05 18:09:50 +08:00
|
|
|
char *extent_buf;
|
btrfs: fix metadata reservation for fallocate that leads to transaction aborts
When doing an fallocate(), specially a zero range operation, we assume
that reserving 3 units of metadata space is enough, that at most we touch
one leaf in subvolume/fs tree for removing existing file extent items and
inserting a new file extent item. This assumption is generally true for
most common use cases. However when we end up needing to remove file extent
items from multiple leaves, we can end up failing with -ENOSPC and abort
the current transaction, turning the filesystem to RO mode. When this
happens a stack trace like the following is dumped in dmesg/syslog:
[ 1500.620934] ------------[ cut here ]------------
[ 1500.620938] BTRFS: Transaction aborted (error -28)
[ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
[ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1
[ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.621026] Code: 8b 40 50 f0 48 (...)
[ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
[ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
[ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
[ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
[ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
[ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
[ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
[ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1500.621049] Call Trace:
[ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs]
[ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs]
[ 1500.621108] vfs_fallocate+0x14d/0x290
[ 1500.621112] ksys_fallocate+0x3a/0x70
[ 1500.621117] __x64_sys_fallocate+0x1a/0x20
[ 1500.621120] do_syscall_64+0x33/0x80
[ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1500.621126] RIP: 0033:0x7fb5b248c477
[ 1500.621128] Code: 89 7c 24 08 (...)
[ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
[ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
[ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
[ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
[ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
[ 1500.621151] irq event stamp: 1026217
[ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
[ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
[ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
[ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
[ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
[ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left
When we use fallocate() internally, for reserving an extent for a space
cache, inode cache or relocation, we can't hit this problem since either
there aren't any file extent items to remove from the subvolume tree or
there is at most one.
When using plain fallocate() it's very unlikely, since that would require
having many file extent items representing holes for the target range and
crossing multiple leafs - we attempt to increase the range (merge) of such
file extent items when punching holes, so at most we end up with 2 file
extent items for holes at leaf boundaries.
However when using the zero range operation of fallocate() for a large
range (100+ MiB for example) that's fairly easy to trigger. The following
example reproducer triggers the issue:
$ cat reproducer.sh
#!/bin/bash
umount /dev/sdj &> /dev/null
mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
mount /dev/sdj /mnt/sdj
# Create a 100M file with many file extent items. Punch a hole every 8K
# just to speedup the file creation - we could do 4K sequential writes
# followed by fsync (or O_SYNC) as well, but that takes a lot of time.
file_size=$((100 * 1024 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
for ((i = 0; i < $file_size; i += 8192)); do
xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
done
# Force a transaction commit, so the zero range operation will be forced
# to COW all metadata extents it need to touch.
sync
xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
umount /mnt/sdj
$ ./reproducer.sh
wrote 104857600/104857600 bytes at offset 0
100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
fallocate: No space left on device
$ dmesg
<shows the same stack trace pasted before>
To fix this use the existing infrastructure that hole punching and
extent cloning use for replacing a file range with another extent. This
deals with doing the removal of file extent items and inserting the new
one using an incremental approach, reserving more space when needed and
always ensuring we don't leave an implicit hole in the range in case
we need to do multiple iterations and a crash happens between iterations.
A test case for fstests will follow up soon.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-08 18:27:20 +08:00
|
|
|
/*
|
|
|
|
* Set to true when attempting to replace a file range with a new extent
|
|
|
|
* described by this structure, set to false when attempting to clone an
|
|
|
|
* existing extent into a file range.
|
|
|
|
*/
|
|
|
|
bool is_new_extent;
|
btrfs: add missing inode updates on each iteration when replacing extents
When replacing file extents, called during fallocate, hole punching,
clone and deduplication, we may not be able to replace/drop all the
target file extent items with a single transaction handle. We may get
-ENOSPC while doing it, in which case we release the transaction handle,
balance the dirty pages of the btree inode, flush delayed items and get
a new transaction handle to operate on what's left of the target range.
By dropping and replacing file extent items we have effectively modified
the inode, so we should bump its iversion and update its mtime/ctime
before we update the inode item. This is because if the transaction
we used for partially modifying the inode gets committed by someone after
we release it and before we finish the rest of the range, a power failure
happens, then after mounting the filesystem our inode has an outdated
iversion and mtime/ctime, corresponding to the values it had before we
changed it.
So add the missing iversion and mtime/ctime updates.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-06 17:41:18 +08:00
|
|
|
/* Indicate if we should update the inode's mtime and ctime. */
|
|
|
|
bool update_times;
|
btrfs: fix metadata reservation for fallocate that leads to transaction aborts
When doing an fallocate(), specially a zero range operation, we assume
that reserving 3 units of metadata space is enough, that at most we touch
one leaf in subvolume/fs tree for removing existing file extent items and
inserting a new file extent item. This assumption is generally true for
most common use cases. However when we end up needing to remove file extent
items from multiple leaves, we can end up failing with -ENOSPC and abort
the current transaction, turning the filesystem to RO mode. When this
happens a stack trace like the following is dumped in dmesg/syslog:
[ 1500.620934] ------------[ cut here ]------------
[ 1500.620938] BTRFS: Transaction aborted (error -28)
[ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
[ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1
[ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.621026] Code: 8b 40 50 f0 48 (...)
[ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
[ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
[ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
[ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
[ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
[ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
[ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
[ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1500.621049] Call Trace:
[ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs]
[ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs]
[ 1500.621108] vfs_fallocate+0x14d/0x290
[ 1500.621112] ksys_fallocate+0x3a/0x70
[ 1500.621117] __x64_sys_fallocate+0x1a/0x20
[ 1500.621120] do_syscall_64+0x33/0x80
[ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1500.621126] RIP: 0033:0x7fb5b248c477
[ 1500.621128] Code: 89 7c 24 08 (...)
[ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
[ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
[ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
[ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
[ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
[ 1500.621151] irq event stamp: 1026217
[ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
[ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
[ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
[ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
[ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
[ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left
When we use fallocate() internally, for reserving an extent for a space
cache, inode cache or relocation, we can't hit this problem since either
there aren't any file extent items to remove from the subvolume tree or
there is at most one.
When using plain fallocate() it's very unlikely, since that would require
having many file extent items representing holes for the target range and
crossing multiple leafs - we attempt to increase the range (merge) of such
file extent items when punching holes, so at most we end up with 2 file
extent items for holes at leaf boundaries.
However when using the zero range operation of fallocate() for a large
range (100+ MiB for example) that's fairly easy to trigger. The following
example reproducer triggers the issue:
$ cat reproducer.sh
#!/bin/bash
umount /dev/sdj &> /dev/null
mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
mount /dev/sdj /mnt/sdj
# Create a 100M file with many file extent items. Punch a hole every 8K
# just to speedup the file creation - we could do 4K sequential writes
# followed by fsync (or O_SYNC) as well, but that takes a lot of time.
file_size=$((100 * 1024 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
for ((i = 0; i < $file_size; i += 8192)); do
xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
done
# Force a transaction commit, so the zero range operation will be forced
# to COW all metadata extents it need to touch.
sync
xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
umount /mnt/sdj
$ ./reproducer.sh
wrote 104857600/104857600 bytes at offset 0
100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
fallocate: No space left on device
$ dmesg
<shows the same stack trace pasted before>
To fix this use the existing infrastructure that hole punching and
extent cloning use for replacing a file range with another extent. This
deals with doing the removal of file extent items and inserting the new
one using an incremental approach, reserving more space when needed and
always ensuring we don't leave an implicit hole in the range in case
we need to do multiple iterations and a crash happens between iterations.
A test case for fstests will follow up soon.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-08 18:27:20 +08:00
|
|
|
/* Meaningful only if is_new_extent is true. */
|
|
|
|
int qgroup_reserved;
|
|
|
|
/*
|
|
|
|
* Meaningful only if is_new_extent is true.
|
|
|
|
* Used to track how many extent items we have already inserted in a
|
|
|
|
* subvolume tree that refer to the extent described by this structure,
|
|
|
|
* so that we know when to create a new delayed ref or update an existing
|
|
|
|
* one.
|
|
|
|
*/
|
|
|
|
int insertions;
|
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-05 18:09:50 +08:00
|
|
|
};
|
|
|
|
|
2020-11-04 19:07:32 +08:00
|
|
|
/* Arguments for btrfs_drop_extents() */
|
|
|
|
struct btrfs_drop_extents_args {
|
|
|
|
/* Input parameters */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If NULL, btrfs_drop_extents() will allocate and free its own path.
|
|
|
|
* If 'replace_extent' is true, this must not be NULL. Also the path
|
|
|
|
* is always released except if 'replace_extent' is true and
|
|
|
|
* btrfs_drop_extents() sets 'extent_inserted' to true, in which case
|
|
|
|
* the path is kept locked.
|
|
|
|
*/
|
|
|
|
struct btrfs_path *path;
|
|
|
|
/* Start offset of the range to drop extents from */
|
|
|
|
u64 start;
|
|
|
|
/* End (exclusive, last byte + 1) of the range to drop extents from */
|
|
|
|
u64 end;
|
|
|
|
/* If true drop all the extent maps in the range */
|
|
|
|
bool drop_cache;
|
|
|
|
/*
|
|
|
|
* If true it means we want to insert a new extent after dropping all
|
|
|
|
* the extents in the range. If this is true, the 'extent_item_size'
|
|
|
|
* parameter must be set as well and the 'extent_inserted' field will
|
|
|
|
* be set to true by btrfs_drop_extents() if it could insert the new
|
|
|
|
* extent.
|
|
|
|
* Note: when this is set to true the path must not be NULL.
|
|
|
|
*/
|
|
|
|
bool replace_extent;
|
|
|
|
/*
|
|
|
|
* Used if 'replace_extent' is true. Size of the file extent item to
|
|
|
|
* insert after dropping all existing extents in the range
|
|
|
|
*/
|
|
|
|
u32 extent_item_size;
|
|
|
|
|
|
|
|
/* Output parameters */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set to the minimum between the input parameter 'end' and the end
|
|
|
|
* (exclusive, last byte + 1) of the last dropped extent. This is always
|
|
|
|
* set even if btrfs_drop_extents() returns an error.
|
|
|
|
*/
|
|
|
|
u64 drop_end;
|
btrfs: update the number of bytes used by an inode atomically
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 19:07:34 +08:00
|
|
|
/*
|
|
|
|
* The number of allocated bytes found in the range. This can be smaller
|
|
|
|
* than the range's length when there are holes in the range.
|
|
|
|
*/
|
|
|
|
u64 bytes_found;
|
2020-11-04 19:07:32 +08:00
|
|
|
/*
|
|
|
|
* Only set if 'replace_extent' is true. Set to true if we were able
|
|
|
|
* to insert a replacement extent after dropping all extents in the
|
|
|
|
* range, otherwise set to false by btrfs_drop_extents().
|
|
|
|
* Also, if btrfs_drop_extents() has set this to true it means it
|
|
|
|
* returned with the path locked, otherwise if it has set this to
|
|
|
|
* false it has returned with the path released.
|
|
|
|
*/
|
|
|
|
bool extent_inserted;
|
|
|
|
};
|
|
|
|
|
2017-07-25 03:14:25 +08:00
|
|
|
struct btrfs_file_private {
|
|
|
|
void *filldir_buf;
|
|
|
|
};
|
|
|
|
|
2007-03-16 00:56:47 +08:00
|
|
|
|
2016-06-15 21:22:56 +08:00
|
|
|
static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info)
|
2016-06-15 22:33:06 +08:00
|
|
|
{
|
2017-05-22 18:16:11 +08:00
|
|
|
|
|
|
|
return info->nodesize - sizeof(struct btrfs_header);
|
2016-06-15 22:33:06 +08:00
|
|
|
}
|
|
|
|
|
2016-06-15 21:22:56 +08:00
|
|
|
static inline u32 BTRFS_MAX_ITEM_SIZE(const struct btrfs_fs_info *info)
|
2016-06-15 22:33:06 +08:00
|
|
|
{
|
2016-06-15 21:22:56 +08:00
|
|
|
return BTRFS_LEAF_DATA_SIZE(info) - sizeof(struct btrfs_item);
|
2016-06-15 22:33:06 +08:00
|
|
|
}
|
|
|
|
|
2016-06-15 21:22:56 +08:00
|
|
|
static inline u32 BTRFS_NODEPTRS_PER_BLOCK(const struct btrfs_fs_info *info)
|
2016-06-15 22:33:06 +08:00
|
|
|
{
|
2016-06-15 21:22:56 +08:00
|
|
|
return BTRFS_LEAF_DATA_SIZE(info) / sizeof(struct btrfs_key_ptr);
|
2016-06-15 22:33:06 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#define BTRFS_FILE_EXTENT_INLINE_DATA_START \
|
|
|
|
(offsetof(struct btrfs_file_extent_item, disk_bytenr))
|
2016-06-15 21:22:56 +08:00
|
|
|
static inline u32 BTRFS_MAX_INLINE_DATA_SIZE(const struct btrfs_fs_info *info)
|
2016-06-15 22:33:06 +08:00
|
|
|
{
|
2016-06-15 21:22:56 +08:00
|
|
|
return BTRFS_MAX_ITEM_SIZE(info) -
|
2016-06-15 22:33:06 +08:00
|
|
|
BTRFS_FILE_EXTENT_INLINE_DATA_START;
|
|
|
|
}
|
|
|
|
|
2016-06-15 21:22:56 +08:00
|
|
|
static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
|
2016-06-15 22:33:06 +08:00
|
|
|
{
|
2016-06-15 21:22:56 +08:00
|
|
|
return BTRFS_MAX_ITEM_SIZE(info) - sizeof(struct btrfs_dir_item);
|
2016-06-15 22:33:06 +08:00
|
|
|
}
|
|
|
|
|
2016-01-21 18:25:53 +08:00
|
|
|
#define BTRFS_BYTES_TO_BLKS(fs_info, bytes) \
|
2020-07-02 03:19:09 +08:00
|
|
|
((bytes) >> (fs_info)->sectorsize_bits)
|
2016-01-21 18:25:53 +08:00
|
|
|
|
2019-05-22 16:18:59 +08:00
|
|
|
static inline u32 btrfs_crc32c(u32 crc, const void *address, unsigned length)
|
|
|
|
{
|
|
|
|
return crc32c(crc, address, length);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_crc32c_final(u32 crc, u8 *result)
|
|
|
|
{
|
|
|
|
put_unaligned_le32(~crc, result);
|
|
|
|
}
|
|
|
|
|
btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-08 17:45:05 +08:00
|
|
|
static inline u64 btrfs_name_hash(const char *name, int len)
|
|
|
|
{
|
|
|
|
return crc32c((u32)~1, name, len);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Figure the key offset of an extended inode ref
|
|
|
|
*/
|
|
|
|
static inline u64 btrfs_extref_hash(u64 parent_objectid, const char *name,
|
|
|
|
int len)
|
|
|
|
{
|
|
|
|
return (u64) crc32c(parent_objectid, name, len);
|
|
|
|
}
|
|
|
|
|
2011-09-22 03:05:58 +08:00
|
|
|
static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
|
|
|
|
{
|
2015-11-07 08:28:49 +08:00
|
|
|
return mapping_gfp_constraint(mapping, ~__GFP_FS);
|
2011-09-22 03:05:58 +08:00
|
|
|
}
|
|
|
|
|
2007-04-18 01:26:50 +08:00
|
|
|
/* extent-tree.c */
|
2015-02-04 22:59:29 +08:00
|
|
|
|
2017-08-19 05:15:18 +08:00
|
|
|
enum btrfs_inline_ref_type {
|
2018-11-27 22:25:13 +08:00
|
|
|
BTRFS_REF_TYPE_INVALID,
|
|
|
|
BTRFS_REF_TYPE_BLOCK,
|
|
|
|
BTRFS_REF_TYPE_DATA,
|
|
|
|
BTRFS_REF_TYPE_ANY,
|
2017-08-19 05:15:18 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
int btrfs_get_extent_inline_ref_type(const struct extent_buffer *eb,
|
|
|
|
struct btrfs_extent_inline_ref *iref,
|
|
|
|
enum btrfs_inline_ref_type is_data);
|
2019-08-09 09:24:24 +08:00
|
|
|
u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset);
|
2017-08-19 05:15:18 +08:00
|
|
|
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
|
2019-06-21 03:37:49 +08:00
|
|
|
int btrfs_add_excluded_extent(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 start, u64 num_bytes);
|
2019-10-30 02:20:18 +08:00
|
|
|
void btrfs_free_excluded_extents(struct btrfs_block_group *cache);
|
2009-03-13 22:10:06 +08:00
|
|
|
int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
|
2018-03-15 23:27:37 +08:00
|
|
|
unsigned long count);
|
2018-11-22 03:05:41 +08:00
|
|
|
void btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs,
|
|
|
|
struct btrfs_delayed_ref_head *head);
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len);
|
2010-05-16 22:48:46 +08:00
|
|
|
int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
|
2016-06-23 06:54:24 +08:00
|
|
|
struct btrfs_fs_info *fs_info, u64 bytenr,
|
2013-03-08 03:22:04 +08:00
|
|
|
u64 offset, int metadata, u64 *refs, u64 *flags);
|
2020-01-20 22:09:09 +08:00
|
|
|
int btrfs_pin_extent(struct btrfs_trans_handle *trans, u64 bytenr, u64 num,
|
|
|
|
int reserved);
|
2020-01-20 22:09:13 +08:00
|
|
|
int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
|
2011-11-01 08:52:39 +08:00
|
|
|
u64 bytenr, u64 num_bytes);
|
2019-03-20 19:14:33 +08:00
|
|
|
int btrfs_exclude_logged_extents(struct extent_buffer *eb);
|
2017-01-31 04:25:28 +08:00
|
|
|
int btrfs_cross_ref_exist(struct btrfs_root *root,
|
2022-03-24 00:19:26 +08:00
|
|
|
u64 objectid, u64 offset, u64 bytenr, bool strict,
|
|
|
|
struct btrfs_path *path);
|
2014-06-15 07:54:12 +08:00
|
|
|
struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
|
2017-01-18 15:24:37 +08:00
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
const struct btrfs_disk_key *key,
|
|
|
|
int level, u64 hint,
|
2020-08-20 23:46:03 +08:00
|
|
|
u64 empty_size,
|
|
|
|
enum btrfs_lock_nesting nest);
|
2010-05-16 22:46:25 +08:00
|
|
|
void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
|
2021-12-13 16:45:12 +08:00
|
|
|
u64 root_id,
|
2010-05-16 22:46:25 +08:00
|
|
|
struct extent_buffer *buf,
|
2012-05-16 23:04:52 +08:00
|
|
|
u64 parent, int last_ref);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
|
2017-09-30 03:43:49 +08:00
|
|
|
struct btrfs_root *root, u64 owner,
|
2015-10-26 14:11:18 +08:00
|
|
|
u64 offset, u64 ram_bytes,
|
|
|
|
struct btrfs_key *ins);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset,
|
|
|
|
struct btrfs_key *ins);
|
btrfs: update btrfs_space_info's bytes_may_use timely
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
| bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
| alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
| change bytes_may_use, and bytes_reserved += 64M. Now
| bytes_may_use + bytes_reserved == 128M, which is greater
| than btrfs_space_info's total_bytes, false enospc occurs.
| Note, the bytes_may_use decrease operation will be done in
| end of btrfs_fallocate(), which is too late.
Here is another simple case for buffered write:
CPU 1 | CPU 2
|
|-> cow_file_range() |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() | |
| | |
| | |
| ..... | |-> btrfs_check_data_free_space()
| |
| |
|-> extent_clear_unlock_delalloc() |
In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.
To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.
As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.
Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-07-25 15:51:40 +08:00
|
|
|
int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, u64 num_bytes,
|
2013-08-15 02:02:47 +08:00
|
|
|
u64 min_alloc_size, u64 empty_size, u64 hint_byte,
|
Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 10:42:50 +08:00
|
|
|
struct btrfs_key *ins, int is_data, int delalloc);
|
2007-03-17 04:20:31 +08:00
|
|
|
int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
2014-07-03 01:54:25 +08:00
|
|
|
struct extent_buffer *buf, int full_backref);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
2014-07-03 01:54:25 +08:00
|
|
|
struct extent_buffer *buf, int full_backref);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
|
2019-03-20 18:45:48 +08:00
|
|
|
struct extent_buffer *eb, u64 flags, int level);
|
2019-04-04 14:45:36 +08:00
|
|
|
int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 start, u64 len, int delalloc);
|
2020-01-20 22:09:12 +08:00
|
|
|
int btrfs_pin_reserved_extent(struct btrfs_trans_handle *trans, u64 start,
|
2019-11-21 20:03:31 +08:00
|
|
|
u64 len);
|
2018-03-15 22:00:26 +08:00
|
|
|
int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans);
|
2007-04-18 01:26:50 +08:00
|
|
|
int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
|
2019-04-04 14:45:35 +08:00
|
|
|
struct btrfs_ref *generic_ref);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-03-11 00:39:20 +08:00
|
|
|
void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 19:33:38 +08:00
|
|
|
|
2013-02-28 18:04:33 +08:00
|
|
|
int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *rsv,
|
2018-05-30 11:00:38 +08:00
|
|
|
int nitems, bool use_global_rsv);
|
btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations
[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
And with the following metadata leak:
BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
------------[ cut here ]------------
WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace a6cfd45ba80e4e06 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
BTRFS info (device dm-3): disk space caching is enabled
BTRFS info (device dm-3): has skinny extents
[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().
This leads to the leakage.
[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.
And free the qgroup reserved metadata space when releasing the
block_rsv.
To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.
Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-24 14:46:10 +08:00
|
|
|
void btrfs_subvolume_release_metadata(struct btrfs_root *root,
|
2017-02-11 02:18:18 +08:00
|
|
|
struct btrfs_block_rsv *rsv);
|
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e36a ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-14 14:34:51 +08:00
|
|
|
void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
|
2017-10-20 02:15:55 +08:00
|
|
|
|
2019-11-19 14:45:55 +08:00
|
|
|
int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
|
btrfs: avoid blocking on space revervation when doing nowait dio writes
When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
proceed with the non-blocking, NOWAIT path. However reserving the metadata
space and qgroup meta space can often result in blocking - flushing
delalloc, wait for ordered extents to complete, trigger transaction
commits, etc, going against the semantics of a NOWAIT write.
So make the NOWAIT write path to try to reserve all the metadata it needs
without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
then return -EAGAIN to make the caller fallback to a blocking direct IO
write.
This is part of a patchset comprised of the following patches:
btrfs: avoid blocking on page locks with nowait dio on compressed range
btrfs: avoid blocking nowait dio when locking file range
btrfs: avoid double nocow check when doing nowait dio writes
btrfs: stop allocating a path when checking if cross reference exists
btrfs: free path at can_nocow_extent() before checking for checksum items
btrfs: release path earlier at can_nocow_extent()
btrfs: avoid blocking when allocating context for nowait dio read/write
btrfs: avoid blocking on space revervation when doing nowait dio writes
The following test was run before and after applying this patchset:
$ cat io-uring-nodatacow-test.sh
#!/bin/bash
DEV=/dev/sdc
MNT=/mnt/sdc
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
NUM_JOBS=4
FILE_SIZE=8G
RUN_TIME=300
cat <<EOF > /tmp/fio-job.ini
[io_uring_rw]
rw=randrw
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
filename=foobar
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run a 12 cores box with 64G of ram, using a non-debug kernel
config (Debian's default config) and a spinning disk.
Result before the patchset:
READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
Result after the patchset:
READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec
That's about +7.2% throughput for reads and +6.9% for writes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-24 00:19:30 +08:00
|
|
|
u64 disk_num_bytes, bool noflush);
|
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 18:07:31 +08:00
|
|
|
u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info,
|
2011-01-06 19:30:25 +08:00
|
|
|
u64 start, u64 end);
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
|
2014-12-08 22:01:12 +08:00
|
|
|
u64 num_bytes, u64 *actual_bytes);
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range);
|
2011-01-06 19:30:25 +08:00
|
|
|
|
2011-03-07 10:13:14 +08:00
|
|
|
int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
|
2012-06-29 00:03:02 +08:00
|
|
|
int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_fs_info *fs_info);
|
2017-06-22 08:19:11 +08:00
|
|
|
int btrfs_start_write_no_snapshotting(struct btrfs_root *root);
|
|
|
|
void btrfs_end_write_no_snapshotting(struct btrfs_root *root);
|
2016-01-06 18:56:36 +08:00
|
|
|
void btrfs_wait_for_snapshot_creation(struct btrfs_root *root);
|
2015-09-30 11:50:35 +08:00
|
|
|
|
2007-03-27 04:00:06 +08:00
|
|
|
/* ctree.c */
|
2022-09-14 23:06:38 +08:00
|
|
|
int __init btrfs_ctree_init(void);
|
|
|
|
void __cold btrfs_ctree_exit(void);
|
2017-01-18 15:24:37 +08:00
|
|
|
int btrfs_bin_search(struct extent_buffer *eb, const struct btrfs_key *key,
|
2020-04-17 15:08:21 +08:00
|
|
|
int *slot);
|
2019-10-02 01:57:39 +08:00
|
|
|
int __pure btrfs_comp_cpu_keys(const struct btrfs_key *k1, const struct btrfs_key *k2);
|
2008-03-25 03:01:56 +08:00
|
|
|
int btrfs_previous_item(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 min_objectid,
|
|
|
|
int type);
|
2014-01-12 21:38:33 +08:00
|
|
|
int btrfs_previous_extent_item(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 min_objectid);
|
2014-11-12 12:43:09 +08:00
|
|
|
void btrfs_set_item_key_safe(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_path *path,
|
2017-01-18 15:24:37 +08:00
|
|
|
const struct btrfs_key *new_key);
|
2008-06-26 04:01:30 +08:00
|
|
|
struct extent_buffer *btrfs_root_node(struct btrfs_root *root);
|
2008-06-26 04:01:31 +08:00
|
|
|
int btrfs_find_next_key(struct btrfs_root *root, struct btrfs_path *path,
|
2008-06-26 04:01:31 +08:00
|
|
|
struct btrfs_key *key, int lowest_level,
|
2013-02-01 02:21:12 +08:00
|
|
|
u64 min_trans);
|
2008-06-26 04:01:31 +08:00
|
|
|
int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
|
2013-02-01 02:21:12 +08:00
|
|
|
struct btrfs_path *path,
|
2008-06-26 04:01:31 +08:00
|
|
|
u64 min_trans);
|
2019-08-22 01:16:27 +08:00
|
|
|
struct extent_buffer *btrfs_read_node_slot(struct extent_buffer *parent,
|
|
|
|
int slot);
|
|
|
|
|
2007-10-16 04:14:19 +08:00
|
|
|
int btrfs_cow_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct extent_buffer *buf,
|
|
|
|
struct extent_buffer *parent, int parent_slot,
|
2020-08-20 23:46:03 +08:00
|
|
|
struct extent_buffer **cow_ret,
|
|
|
|
enum btrfs_lock_nesting nest);
|
2007-12-18 09:14:01 +08:00
|
|
|
int btrfs_copy_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf,
|
|
|
|
struct extent_buffer **cow_ret, u64 new_root_objectid);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_block_can_be_shared(struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf);
|
2019-03-20 21:51:10 +08:00
|
|
|
void btrfs_extend_item(struct btrfs_path *path, u32 data_size);
|
2019-03-20 21:49:12 +08:00
|
|
|
void btrfs_truncate_item(struct btrfs_path *path, u32 new_size, int from_end);
|
2008-12-10 22:10:46 +08:00
|
|
|
int btrfs_split_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
2017-01-18 15:24:37 +08:00
|
|
|
const struct btrfs_key *new_key,
|
2008-12-10 22:10:46 +08:00
|
|
|
unsigned long split_offset);
|
2009-11-12 17:33:58 +08:00
|
|
|
int btrfs_duplicate_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
2017-01-18 15:24:37 +08:00
|
|
|
const struct btrfs_key *new_key);
|
2013-11-05 11:33:33 +08:00
|
|
|
int btrfs_find_item(struct btrfs_root *fs_root, struct btrfs_path *path,
|
|
|
|
u64 inum, u64 ioff, u8 key_type, struct btrfs_key *found_key);
|
2017-01-18 15:24:37 +08:00
|
|
|
int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
const struct btrfs_key *key, struct btrfs_path *p,
|
|
|
|
int ins_len, int cow);
|
|
|
|
int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
|
2012-05-17 00:25:47 +08:00
|
|
|
struct btrfs_path *p, u64 time_seq);
|
2011-09-13 17:18:10 +08:00
|
|
|
int btrfs_search_slot_for_read(struct btrfs_root *root,
|
2017-01-18 15:24:37 +08:00
|
|
|
const struct btrfs_key *key,
|
|
|
|
struct btrfs_path *p, int find_higher,
|
|
|
|
int return_any);
|
2007-08-08 04:15:09 +08:00
|
|
|
int btrfs_realloc_node(struct btrfs_trans_handle *trans,
|
2007-10-16 04:14:19 +08:00
|
|
|
struct btrfs_root *root, struct extent_buffer *parent,
|
2013-02-01 02:21:12 +08:00
|
|
|
int start_slot, u64 *last_ret,
|
2007-10-16 04:22:39 +08:00
|
|
|
struct btrfs_key *progress);
|
2011-04-21 07:20:15 +08:00
|
|
|
void btrfs_release_path(struct btrfs_path *p);
|
2007-04-02 22:50:19 +08:00
|
|
|
struct btrfs_path *btrfs_alloc_path(void);
|
|
|
|
void btrfs_free_path(struct btrfs_path *p);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 22:25:08 +08:00
|
|
|
|
2008-01-30 04:11:36 +08:00
|
|
|
int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, int slot, int nr);
|
|
|
|
static inline int btrfs_del_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
return btrfs_del_items(trans, root, path, path->slots[0], 1);
|
|
|
|
}
|
|
|
|
|
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 19:28:13 +08:00
|
|
|
/*
|
|
|
|
* Describes a batch of items to insert in a btree. This is used by
|
2021-09-24 19:28:14 +08:00
|
|
|
* btrfs_insert_empty_items().
|
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 19:28:13 +08:00
|
|
|
*/
|
|
|
|
struct btrfs_item_batch {
|
|
|
|
/*
|
|
|
|
* Pointer to an array containing the keys of the items to insert (in
|
|
|
|
* sorted order).
|
|
|
|
*/
|
|
|
|
const struct btrfs_key *keys;
|
|
|
|
/* Pointer to an array containing the data size for each item to insert. */
|
|
|
|
const u32 *data_sizes;
|
|
|
|
/*
|
|
|
|
* The sum of data sizes for all items. The caller can compute this while
|
|
|
|
* setting up the data_sizes array, so it ends up being more efficient
|
|
|
|
* than having btrfs_insert_empty_items() or setup_item_for_insert()
|
|
|
|
* doing it, as it would avoid an extra loop over a potentially large
|
|
|
|
* array, and in the case of setup_item_for_insert(), we would be doing
|
|
|
|
* it while holding a write lock on a leaf and often on upper level nodes
|
|
|
|
* too, unnecessarily increasing the size of a critical section.
|
|
|
|
*/
|
|
|
|
u32 total_data_size;
|
|
|
|
/* Size of the keys and data_sizes arrays (number of items in the batch). */
|
|
|
|
int nr;
|
|
|
|
};
|
|
|
|
|
2021-09-24 19:28:14 +08:00
|
|
|
void btrfs_setup_item_for_insert(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
const struct btrfs_key *key,
|
|
|
|
u32 data_size);
|
2017-01-18 15:24:37 +08:00
|
|
|
int btrfs_insert_item(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
const struct btrfs_key *key, void *data, u32 data_size);
|
2008-01-30 04:15:18 +08:00
|
|
|
int btrfs_insert_empty_items(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 19:28:13 +08:00
|
|
|
const struct btrfs_item_batch *batch);
|
2008-01-30 04:15:18 +08:00
|
|
|
|
|
|
|
static inline int btrfs_insert_empty_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
2017-01-18 15:24:37 +08:00
|
|
|
const struct btrfs_key *key,
|
2008-01-30 04:15:18 +08:00
|
|
|
u32 data_size)
|
|
|
|
{
|
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-24 19:28:13 +08:00
|
|
|
struct btrfs_item_batch batch;
|
|
|
|
|
|
|
|
batch.keys = key;
|
|
|
|
batch.data_sizes = &data_size;
|
|
|
|
batch.total_data_size = data_size;
|
|
|
|
batch.nr = 1;
|
|
|
|
|
|
|
|
return btrfs_insert_empty_items(trans, root, path, &batch);
|
2008-01-30 04:15:18 +08:00
|
|
|
}
|
|
|
|
|
2013-10-23 00:18:51 +08:00
|
|
|
int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path);
|
2012-06-11 14:29:29 +08:00
|
|
|
int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
|
|
|
|
u64 time_seq);
|
2021-07-29 16:22:16 +08:00
|
|
|
|
|
|
|
int btrfs_search_backwards(struct btrfs_root *root, struct btrfs_key *key,
|
|
|
|
struct btrfs_path *path);
|
|
|
|
|
2022-03-09 21:50:38 +08:00
|
|
|
int btrfs_get_next_valid_item(struct btrfs_root *root, struct btrfs_key *key,
|
|
|
|
struct btrfs_path *path);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Search in @root for a given @key, and store the slot found in @found_key.
|
|
|
|
*
|
|
|
|
* @root: The root node of the tree.
|
|
|
|
* @key: The key we are looking for.
|
|
|
|
* @found_key: Will hold the found item.
|
|
|
|
* @path: Holds the current slot/leaf.
|
|
|
|
* @iter_ret: Contains the value returned from btrfs_search_slot or
|
|
|
|
* btrfs_get_next_valid_item, whichever was executed last.
|
|
|
|
*
|
|
|
|
* The @iter_ret is an output variable that will contain the return value of
|
|
|
|
* btrfs_search_slot, if it encountered an error, or the value returned from
|
|
|
|
* btrfs_get_next_valid_item otherwise. That return value can be 0, if a valid
|
|
|
|
* slot was found, 1 if there were no more leaves, and <0 if there was an error.
|
|
|
|
*
|
|
|
|
* It's recommended to use a separate variable for iter_ret and then use it to
|
|
|
|
* set the function return value so there's no confusion of the 0/1/errno
|
|
|
|
* values stemming from btrfs_search_slot.
|
|
|
|
*/
|
|
|
|
#define btrfs_for_each_slot(root, key, found_key, path, iter_ret) \
|
|
|
|
for (iter_ret = btrfs_search_slot(NULL, (root), (key), (path), 0, 0); \
|
|
|
|
(iter_ret) >= 0 && \
|
|
|
|
(iter_ret = btrfs_get_next_valid_item((root), (found_key), (path))) == 0; \
|
|
|
|
(path)->slots[0]++ \
|
|
|
|
)
|
|
|
|
|
2022-09-14 23:06:40 +08:00
|
|
|
int btrfs_next_old_item(struct btrfs_root *root, struct btrfs_path *path, u64 time_seq);
|
2021-07-26 20:15:12 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Search the tree again to find a leaf with greater keys.
|
|
|
|
*
|
|
|
|
* Returns 0 if it found something or 1 if there are no greater leaves.
|
|
|
|
* Returns < 0 on error.
|
|
|
|
*/
|
|
|
|
static inline int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
return btrfs_next_old_leaf(root, path, 0);
|
|
|
|
}
|
|
|
|
|
2012-06-19 21:42:25 +08:00
|
|
|
static inline int btrfs_next_item(struct btrfs_root *root, struct btrfs_path *p)
|
|
|
|
{
|
|
|
|
return btrfs_next_old_item(root, p, 0);
|
|
|
|
}
|
2019-03-20 21:36:46 +08:00
|
|
|
int btrfs_leaf_free_space(struct extent_buffer *leaf);
|
2020-03-10 17:43:51 +08:00
|
|
|
int __must_check btrfs_drop_snapshot(struct btrfs_root *root, int update_ref,
|
|
|
|
int for_reloc);
|
2008-10-30 02:49:05 +08:00
|
|
|
int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *node,
|
|
|
|
struct extent_buffer *parent);
|
2013-05-14 18:20:43 +08:00
|
|
|
|
2007-03-27 04:00:06 +08:00
|
|
|
/* root-item.c */
|
2018-08-01 11:32:29 +08:00
|
|
|
int btrfs_add_root_ref(struct btrfs_trans_handle *trans, u64 root_id,
|
2022-10-21 00:58:25 +08:00
|
|
|
u64 ref_id, u64 dirid, u64 sequence,
|
2022-10-21 00:58:27 +08:00
|
|
|
const struct fscrypt_str *name);
|
2018-08-01 11:32:28 +08:00
|
|
|
int btrfs_del_root_ref(struct btrfs_trans_handle *trans, u64 root_id,
|
2022-10-21 00:58:25 +08:00
|
|
|
u64 ref_id, u64 dirid, u64 *sequence,
|
2022-10-21 00:58:27 +08:00
|
|
|
const struct fscrypt_str *name);
|
2017-08-17 22:25:11 +08:00
|
|
|
int btrfs_del_root(struct btrfs_trans_handle *trans,
|
2018-08-01 11:32:27 +08:00
|
|
|
const struct btrfs_key *key);
|
2017-01-18 15:24:37 +08:00
|
|
|
int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
const struct btrfs_key *key,
|
|
|
|
struct btrfs_root_item *item);
|
2011-10-04 11:22:44 +08:00
|
|
|
int __must_check btrfs_update_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_key *key,
|
|
|
|
struct btrfs_root_item *item);
|
2017-01-18 15:24:37 +08:00
|
|
|
int btrfs_find_root(struct btrfs_root *root, const struct btrfs_key *search_key,
|
2013-05-15 15:48:19 +08:00
|
|
|
struct btrfs_path *path, struct btrfs_root_item *root_item,
|
|
|
|
struct btrfs_key *root_key);
|
2016-06-22 09:16:51 +08:00
|
|
|
int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info);
|
2011-07-15 05:23:06 +08:00
|
|
|
void btrfs_set_root_node(struct btrfs_root_item *item,
|
|
|
|
struct extent_buffer *node);
|
2011-03-28 10:01:25 +08:00
|
|
|
void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
|
2012-07-25 23:35:53 +08:00
|
|
|
void btrfs_update_root_times(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2011-03-28 10:01:25 +08:00
|
|
|
|
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 23:11:17 +08:00
|
|
|
/* uuid-tree.c */
|
2018-05-29 15:01:53 +08:00
|
|
|
int btrfs_uuid_tree_add(struct btrfs_trans_handle *trans, u8 *uuid, u8 type,
|
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 23:11:17 +08:00
|
|
|
u64 subid);
|
2018-05-29 15:01:54 +08:00
|
|
|
int btrfs_uuid_tree_remove(struct btrfs_trans_handle *trans, u8 *uuid, u8 type,
|
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 23:11:17 +08:00
|
|
|
u64 subid);
|
2020-02-18 22:56:07 +08:00
|
|
|
int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info);
|
Btrfs: introduce a tree for items that map UUIDs to something
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-15 23:11:17 +08:00
|
|
|
|
2007-03-27 04:00:06 +08:00
|
|
|
/* dir-item.c */
|
2012-12-18 03:26:57 +08:00
|
|
|
int btrfs_check_dir_item_collision(struct btrfs_root *root, u64 dir,
|
2022-10-21 00:58:27 +08:00
|
|
|
const struct fscrypt_str *name);
|
2022-10-21 00:58:25 +08:00
|
|
|
int btrfs_insert_dir_item(struct btrfs_trans_handle *trans,
|
2022-10-21 00:58:27 +08:00
|
|
|
const struct fscrypt_str *name, struct btrfs_inode *dir,
|
2008-07-25 00:12:38 +08:00
|
|
|
struct btrfs_key *location, u8 type, u64 index);
|
2007-04-20 03:36:27 +08:00
|
|
|
struct btrfs_dir_item *btrfs_lookup_dir_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dir,
|
2022-10-21 00:58:27 +08:00
|
|
|
const struct fscrypt_str *name, int mod);
|
2007-04-20 03:36:27 +08:00
|
|
|
struct btrfs_dir_item *
|
|
|
|
btrfs_lookup_dir_index_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dir,
|
2022-10-21 00:58:27 +08:00
|
|
|
u64 index, const struct fscrypt_str *name, int mod);
|
2009-09-22 03:56:00 +08:00
|
|
|
struct btrfs_dir_item *
|
|
|
|
btrfs_search_dir_index_item(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dirid,
|
2022-10-21 00:58:27 +08:00
|
|
|
const struct fscrypt_str *name);
|
2007-04-20 03:36:27 +08:00
|
|
|
int btrfs_delete_one_dir_name(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_dir_item *di);
|
2007-11-17 00:45:54 +08:00
|
|
|
int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
|
2009-11-12 17:35:27 +08:00
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 objectid,
|
|
|
|
const char *name, u16 name_len,
|
|
|
|
const void *data, u16 data_len);
|
2007-11-17 00:45:54 +08:00
|
|
|
struct btrfs_dir_item *btrfs_lookup_xattr(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dir,
|
|
|
|
const char *name, u16 name_len,
|
|
|
|
int mod);
|
2016-06-23 06:54:24 +08:00
|
|
|
struct btrfs_dir_item *btrfs_match_dir_item_name(struct btrfs_fs_info *fs_info,
|
2014-11-09 16:38:39 +08:00
|
|
|
struct btrfs_path *path,
|
|
|
|
const char *name,
|
|
|
|
int name_len);
|
2008-07-25 00:17:14 +08:00
|
|
|
|
|
|
|
/* orphan.c */
|
|
|
|
int btrfs_insert_orphan_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 offset);
|
|
|
|
int btrfs_del_orphan_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 offset);
|
2009-09-22 03:56:00 +08:00
|
|
|
int btrfs_find_orphan_item(struct btrfs_root *root, u64 offset);
|
2008-07-25 00:17:14 +08:00
|
|
|
|
2007-03-27 04:00:06 +08:00
|
|
|
/* file-item.c */
|
2008-12-10 22:10:46 +08:00
|
|
|
int btrfs_del_csums(struct btrfs_trans_handle *trans,
|
Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-06 00:58:30 +08:00
|
|
|
struct btrfs_root *root, u64 bytenr, u64 len);
|
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
|
|
|
blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio, u8 *dst);
|
2022-07-24 06:25:29 +08:00
|
|
|
int btrfs_insert_hole_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 objectid, u64 pos,
|
|
|
|
u64 num_bytes);
|
2007-03-27 04:00:06 +08:00
|
|
|
int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 objectid,
|
2007-10-16 04:15:53 +08:00
|
|
|
u64 bytenr, int mod);
|
2008-02-21 01:07:25 +08:00
|
|
|
int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-09 05:58:54 +08:00
|
|
|
struct btrfs_root *root,
|
2008-07-18 00:53:50 +08:00
|
|
|
struct btrfs_ordered_sum *sums);
|
2020-06-03 13:55:07 +08:00
|
|
|
blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio,
|
2019-11-07 07:38:43 +08:00
|
|
|
u64 offset, bool one_ordered);
|
2011-03-08 21:14:00 +08:00
|
|
|
int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
|
2022-09-13 03:27:43 +08:00
|
|
|
struct list_head *list, int search_commit,
|
|
|
|
bool nowait);
|
2017-02-20 19:51:02 +08:00
|
|
|
void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
|
2014-06-09 10:48:05 +08:00
|
|
|
const struct btrfs_path *path,
|
|
|
|
struct btrfs_file_extent_item *fi,
|
|
|
|
const bool new_inline,
|
|
|
|
struct extent_map *em);
|
2020-01-17 22:02:21 +08:00
|
|
|
int btrfs_inode_clear_file_extent_range(struct btrfs_inode *inode, u64 start,
|
|
|
|
u64 len);
|
|
|
|
int btrfs_inode_set_file_extent_range(struct btrfs_inode *inode, u64 start,
|
|
|
|
u64 len);
|
2020-11-02 22:48:53 +08:00
|
|
|
void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_size);
|
2020-03-09 20:41:06 +08:00
|
|
|
u64 btrfs_file_extent_end(const struct btrfs_path *path);
|
2014-06-09 10:48:05 +08:00
|
|
|
|
2007-06-12 18:35:45 +08:00
|
|
|
/* inode.c */
|
2022-05-26 15:36:35 +08:00
|
|
|
void btrfs_submit_data_write_bio(struct inode *inode, struct bio *bio, int mirror_num);
|
|
|
|
void btrfs_submit_data_read_bio(struct inode *inode, struct bio *bio,
|
|
|
|
int mirror_num, enum btrfs_compression_type compress_type);
|
btrfs: introduce a data checksum checking helper
Although we have several data csum verification code, we never have a
function really just to verify checksum for one sector.
Function check_data_csum() do extra work for error reporting, thus it
requires a lot of extra things like file offset, bio_offset etc.
Function btrfs_verify_data_csum() is even worse, it will utilize page
checked flag, which means it can not be utilized for direct IO pages.
Here we introduce a new helper, btrfs_check_sector_csum(), which really
only accept a sector in page, and expected checksum pointer.
We use this function to implement check_data_csum(), and export it for
incoming patch.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: keep passing the csum array as an arguments, as the callers want
to print it, rename per request]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-22 19:47:48 +08:00
|
|
|
int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
|
|
|
|
u32 pgoff, u8 *csum, const u8 * const csum_expected);
|
2022-07-07 13:33:30 +08:00
|
|
|
int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
|
|
|
|
u32 bio_offset, struct page *page, u32 pgoff);
|
2021-09-15 15:17:18 +08:00
|
|
|
unsigned int btrfs_verify_data_csum(struct btrfs_bio *bbio,
|
|
|
|
u32 bio_offset, struct page *page,
|
|
|
|
u64 start, u64 end);
|
2022-07-07 13:33:29 +08:00
|
|
|
int btrfs_check_data_csum(struct inode *inode, struct btrfs_bio *bbio,
|
|
|
|
u32 bio_offset, struct page *page, u32 pgoff);
|
2013-08-15 02:02:47 +08:00
|
|
|
noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
|
2013-06-22 04:37:03 +08:00
|
|
|
u64 *orig_start, u64 *orig_block_len,
|
2022-09-13 03:27:43 +08:00
|
|
|
u64 *ram_bytes, bool nowait, bool strict);
|
2008-07-24 21:51:08 +08:00
|
|
|
|
2018-04-27 17:21:51 +08:00
|
|
|
void __btrfs_del_delalloc_inode(struct btrfs_root *root,
|
|
|
|
struct btrfs_inode *inode);
|
2008-11-18 10:02:50 +08:00
|
|
|
struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry);
|
2017-02-20 19:50:35 +08:00
|
|
|
int btrfs_set_inode_index(struct btrfs_inode *dir, u64 *index);
|
2008-09-06 04:13:11 +08:00
|
|
|
int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
|
2017-01-18 06:31:44 +08:00
|
|
|
struct btrfs_inode *dir, struct btrfs_inode *inode,
|
2022-10-21 00:58:27 +08:00
|
|
|
const struct fscrypt_str *name);
|
2008-09-06 04:13:11 +08:00
|
|
|
int btrfs_add_link(struct btrfs_trans_handle *trans,
|
2017-02-20 19:51:08 +08:00
|
|
|
struct btrfs_inode *parent_inode, struct btrfs_inode *inode,
|
2022-10-21 00:58:27 +08:00
|
|
|
const struct fscrypt_str *name, int add_backref, u64 index);
|
2018-04-18 10:34:52 +08:00
|
|
|
int btrfs_delete_subvolume(struct inode *dir, struct dentry *dentry);
|
2020-11-02 22:49:03 +08:00
|
|
|
int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
|
|
|
|
int front);
|
2008-09-06 04:13:11 +08:00
|
|
|
|
btrfs: fix deadlock when cloning inline extents and using qgroups
There are a few exceptional cases where cloning an inline extent needs to
copy the inline extent data into a page of the destination inode.
When this happens, we end up starting a transaction while having a dirty
page for the destination inode and while having the range locked in the
destination's inode iotree too. Because when reserving metadata space
for a transaction we may need to flush existing delalloc in case there is
not enough free space, we have a mechanism in place to prevent a deadlock,
which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when
cloning inline extent and low on free metadata space").
However when using qgroups, a transaction also reserves metadata qgroup
space, which can also result in flushing delalloc in case there is not
enough available space at the moment. When this happens we deadlock, since
flushing delalloc requires locking the file range in the inode's iotree
and the range was already locked at the very beginning of the clone
operation, before attempting to start the transaction.
When this issue happens, stack traces like the following are reported:
[72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000
[72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
[72747.556271] Call Trace:
[72747.556273] __schedule+0x296/0x760
[72747.556277] schedule+0x3c/0xa0
[72747.556279] io_schedule+0x12/0x40
[72747.556284] __lock_page+0x13c/0x280
[72747.556287] ? generic_file_readonly_mmap+0x70/0x70
[72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs]
[72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160
[72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
[72747.556362] ? update_group_capacity+0x25/0x210
[72747.556366] ? cpumask_next_and+0x1a/0x20
[72747.556391] extent_writepages+0x44/0xa0 [btrfs]
[72747.556394] do_writepages+0x41/0xd0
[72747.556398] __writeback_single_inode+0x39/0x2a0
[72747.556403] writeback_sb_inodes+0x1ea/0x440
[72747.556407] __writeback_inodes_wb+0x5f/0xc0
[72747.556410] wb_writeback+0x235/0x2b0
[72747.556414] ? get_nr_inodes+0x35/0x50
[72747.556417] wb_workfn+0x354/0x490
[72747.556420] ? newidle_balance+0x2c5/0x3e0
[72747.556424] process_one_work+0x1aa/0x340
[72747.556426] worker_thread+0x30/0x390
[72747.556429] ? create_worker+0x1a0/0x1a0
[72747.556432] kthread+0x116/0x130
[72747.556435] ? kthread_park+0x80/0x80
[72747.556438] ret_from_fork+0x1f/0x30
[72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[72747.566961] Call Trace:
[72747.566964] __schedule+0x296/0x760
[72747.566968] ? finish_wait+0x80/0x80
[72747.566970] schedule+0x3c/0xa0
[72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
[72747.566999] ? finish_wait+0x80/0x80
[72747.567024] lock_extent_bits+0x37/0x90 [btrfs]
[72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs]
[72747.567051] ? find_get_pages_range_tag+0x2cd/0x380
[72747.567076] __extent_writepage+0x203/0x320 [btrfs]
[72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs]
[72747.567106] ? update_load_avg+0x7e/0x5f0
[72747.567109] ? enqueue_entity+0xf4/0x6f0
[72747.567134] extent_writepages+0x44/0xa0 [btrfs]
[72747.567137] ? enqueue_task_fair+0x93/0x6f0
[72747.567140] do_writepages+0x41/0xd0
[72747.567144] __filemap_fdatawrite_range+0xc7/0x100
[72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs]
[72747.567200] process_one_work+0x1aa/0x340
[72747.567202] worker_thread+0x30/0x390
[72747.567205] ? create_worker+0x1a0/0x1a0
[72747.567208] kthread+0x116/0x130
[72747.567211] ? kthread_park+0x80/0x80
[72747.567214] ret_from_fork+0x1f/0x30
[72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000
[72747.569689] Call Trace:
[72747.569691] __schedule+0x296/0x760
[72747.569694] schedule+0x3c/0xa0
[72747.569721] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569725] ? finish_wait+0x80/0x80
[72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
[72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
[72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
[72747.569810] ? path_lookupat.isra.48+0x97/0x140
[72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs]
[72747.569836] ? __kmalloc+0x16a/0x2c0
[72747.569839] do_iter_readv_writev+0x160/0x1c0
[72747.569843] do_iter_write+0x80/0x1b0
[72747.569847] vfs_writev+0x84/0x140
[72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs]
[72747.569873] do_writev+0x65/0x100
[72747.569876] do_syscall_64+0x33/0x40
[72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000
[72747.569903] Call Trace:
[72747.569906] __schedule+0x296/0x760
[72747.569909] schedule+0x3c/0xa0
[72747.569936] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569940] ? finish_wait+0x80/0x80
[72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
[72747.569989] start_transaction+0x279/0x580 [btrfs]
[72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs]
[72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs]
[72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs]
[72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs]
[72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
[72747.570126] do_clone_file_range+0xed/0x200
[72747.570131] vfs_clone_file_range+0x37/0x110
[72747.570134] ioctl_file_clone+0x7d/0xb0
[72747.570137] do_vfs_ioctl+0x138/0x630
[72747.570140] __x64_sys_ioctl+0x62/0xc0
[72747.570143] do_syscall_64+0x33/0x40
[72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9
So fix this by skipping the flush of delalloc for an inode that is
flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
such a special case of cloning an inline extent, when flushing delalloc
during qgroup metadata reservation.
The special cases for cloning inline extents were added in kernel 5.7 by
by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for
inline extents"), while having qgroup metadata space reservation flushing
delalloc when low on space was added in kernel 5.9 by commit
c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get
-EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
kernel backports.
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-22 19:08:05 +08:00
|
|
|
int btrfs_start_delalloc_snapshot(struct btrfs_root *root, bool in_reclaim_context);
|
2021-01-11 18:58:11 +08:00
|
|
|
int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
|
btrfs: fix deadlock when cloning inline extent and low on free metadata space
When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:
* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;
* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;
* the async reclaim task blocks waiting for the delalloc work to complete;
* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;
* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.
When this happens, traces like the following show up in dmesg/syslog:
[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)
Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).
Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.
This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.
CC: stable@vger.kernel.org # 5.9+
Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 19:55:58 +08:00
|
|
|
bool in_reclaim_context);
|
2020-06-03 13:55:35 +08:00
|
|
|
int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
|
2017-11-04 08:16:59 +08:00
|
|
|
unsigned int extra_bits,
|
2019-07-17 21:18:17 +08:00
|
|
|
struct extent_state **cached_state);
|
2022-03-15 09:12:34 +08:00
|
|
|
struct btrfs_new_inode_args {
|
|
|
|
/* Input */
|
|
|
|
struct inode *dir;
|
|
|
|
struct dentry *dentry;
|
|
|
|
struct inode *inode;
|
|
|
|
bool orphan;
|
|
|
|
bool subvol;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Output from btrfs_new_inode_prepare(), input to
|
|
|
|
* btrfs_create_new_inode().
|
|
|
|
*/
|
|
|
|
struct posix_acl *default_acl;
|
|
|
|
struct posix_acl *acl;
|
2022-10-21 00:58:26 +08:00
|
|
|
struct fscrypt_name fname;
|
2022-03-15 09:12:34 +08:00
|
|
|
};
|
|
|
|
int btrfs_new_inode_prepare(struct btrfs_new_inode_args *args,
|
|
|
|
unsigned int *trans_num_items);
|
|
|
|
int btrfs_create_new_inode(struct btrfs_trans_handle *trans,
|
btrfs: move common inode creation code into btrfs_create_new_inode()
All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:
1. It reduces code duplication.
2. It allows us to set up the inode completely before inserting the
inode item, removing calls to btrfs_update_inode().
3. It fixes a leak of an inode on disk in some error cases. For example,
in btrfs_create(), if btrfs_new_inode() succeeds, then we have
inserted an inode item and its inode ref. However, if something after
that fails (e.g., btrfs_init_inode_security()), then we end the
transaction and then decrement the link count on the inode. If the
transaction is committed and the system crashes before the failed
inode is deleted, then we leak that inode on disk. Instead, this
refactoring aborts the transaction when we can't recover more
gracefully.
4. It exposes various ways that subvolume creation diverges from mkdir
in terms of inheriting flags, properties, permissions, and POSIX
ACLs, a lot of which appears to be accidental. This patch explicitly
does _not_ change the existing non-standard behavior, but it makes
those differences more clear in the code and documents them so that
we can discuss whether they should be changed.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15 09:12:35 +08:00
|
|
|
struct btrfs_new_inode_args *args);
|
2022-03-15 09:12:34 +08:00
|
|
|
void btrfs_new_inode_args_destroy(struct btrfs_new_inode_args *args);
|
2022-03-15 09:12:32 +08:00
|
|
|
struct inode *btrfs_new_subvol_inode(struct user_namespace *mnt_userns,
|
|
|
|
struct inode *dir);
|
2018-11-08 16:18:08 +08:00
|
|
|
void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
|
2020-06-25 23:54:54 +08:00
|
|
|
u32 bits);
|
2018-11-01 20:09:51 +08:00
|
|
|
void btrfs_clear_delalloc_extent(struct inode *inode,
|
2020-06-25 23:54:54 +08:00
|
|
|
struct extent_state *state, u32 bits);
|
2018-11-01 20:09:52 +08:00
|
|
|
void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
|
|
|
|
struct extent_state *other);
|
2018-11-01 20:09:53 +08:00
|
|
|
void btrfs_split_delalloc_extent(struct inode *inode,
|
|
|
|
struct extent_state *orig, u64 split);
|
2021-05-31 16:50:49 +08:00
|
|
|
void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end);
|
2018-06-06 22:24:44 +08:00
|
|
|
vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
|
2010-06-07 23:35:40 +08:00
|
|
|
void btrfs_evict_inode(struct inode *inode);
|
2007-06-12 18:35:45 +08:00
|
|
|
struct inode *btrfs_alloc_inode(struct super_block *sb);
|
|
|
|
void btrfs_destroy_inode(struct inode *inode);
|
2019-04-11 03:14:41 +08:00
|
|
|
void btrfs_free_inode(struct inode *inode);
|
2010-06-08 01:43:19 +08:00
|
|
|
int btrfs_drop_inode(struct inode *inode);
|
2017-11-03 07:21:50 +08:00
|
|
|
int __init btrfs_init_cachep(void);
|
2018-02-20 00:24:18 +08:00
|
|
|
void __cold btrfs_destroy_cachep(void);
|
2020-05-16 01:35:59 +08:00
|
|
|
struct inode *btrfs_iget_path(struct super_block *s, u64 ino,
|
2019-10-04 01:09:35 +08:00
|
|
|
struct btrfs_root *root, struct btrfs_path *path);
|
2020-05-16 01:35:59 +08:00
|
|
|
struct inode *btrfs_iget(struct super_block *s, u64 ino, struct btrfs_root *root);
|
2017-02-20 19:51:06 +08:00
|
|
|
struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
|
2018-08-25 13:47:59 +08:00
|
|
|
struct page *page, size_t pg_offset,
|
2019-12-03 09:34:23 +08:00
|
|
|
u64 start, u64 end);
|
2007-08-28 04:49:44 +08:00
|
|
|
int btrfs_update_inode(struct btrfs_trans_handle *trans,
|
2020-11-02 22:48:59 +08:00
|
|
|
struct btrfs_root *root, struct btrfs_inode *inode);
|
2012-10-23 03:43:12 +08:00
|
|
|
int btrfs_update_inode_fallback(struct btrfs_trans_handle *trans,
|
2020-11-02 22:49:06 +08:00
|
|
|
struct btrfs_root *root, struct btrfs_inode *inode);
|
2017-02-20 19:50:59 +08:00
|
|
|
int btrfs_orphan_add(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_inode *inode);
|
2011-02-01 05:22:42 +08:00
|
|
|
int btrfs_orphan_cleanup(struct btrfs_root *root);
|
2020-11-02 22:49:04 +08:00
|
|
|
int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size);
|
2009-11-12 17:36:34 +08:00
|
|
|
void btrfs_add_delayed_iput(struct inode *inode);
|
2016-06-23 06:54:24 +08:00
|
|
|
void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
|
2018-12-04 00:06:52 +08:00
|
|
|
int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
|
2010-05-16 22:49:59 +08:00
|
|
|
int btrfs_prealloc_file_range(struct inode *inode, int mode,
|
|
|
|
u64 start, u64 num_bytes, u64 min_size,
|
|
|
|
loff_t actual_len, u64 *alloc_hint);
|
2010-06-22 02:48:16 +08:00
|
|
|
int btrfs_prealloc_file_range_trans(struct inode *inode,
|
|
|
|
struct btrfs_trans_handle *trans, int mode,
|
|
|
|
u64 start, u64 num_bytes, u64 min_size,
|
|
|
|
loff_t actual_len, u64 *alloc_hint);
|
2020-06-03 13:55:29 +08:00
|
|
|
int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page,
|
2018-11-01 20:09:46 +08:00
|
|
|
u64 start, u64 end, int *page_started, unsigned long *nr_written,
|
|
|
|
struct writeback_control *wbc);
|
2021-07-27 13:41:32 +08:00
|
|
|
int btrfs_writepage_cow_fixup(struct page *page);
|
2021-04-08 20:32:27 +08:00
|
|
|
void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
|
|
|
|
struct page *page, u64 start,
|
2021-07-26 20:15:08 +08:00
|
|
|
u64 end, bool uptodate);
|
2022-03-18 01:25:42 +08:00
|
|
|
int btrfs_encoded_io_compression_from_extent(struct btrfs_fs_info *fs_info,
|
|
|
|
int compress_type);
|
|
|
|
int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
|
|
|
|
u64 file_offset, u64 disk_bytenr,
|
|
|
|
u64 disk_io_size,
|
|
|
|
struct page **pages);
|
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-10 08:59:07 +08:00
|
|
|
ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
|
|
|
|
struct btrfs_ioctl_encoded_io_args *encoded);
|
2019-08-14 07:00:02 +08:00
|
|
|
ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
|
|
|
|
const struct btrfs_ioctl_encoded_io_args *encoded);
|
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-10 08:59:07 +08:00
|
|
|
|
btrfs: fix lost file sync on direct IO write with nowait and dsync iocb
When doing a direct IO write using a iocb with nowait and dsync set, we
end up not syncing the file once the write completes.
This is because we tell iomap to not call generic_write_sync(), which
would result in calling btrfs_sync_file(), in order to avoid a deadlock
since iomap can call it while we are holding the inode's lock and
btrfs_sync_file() needs to acquire the inode's lock. The deadlock happens
only if the write happens synchronously, when iomap_dio_rw() calls
iomap_dio_complete() before it returns. Instead we do the sync ourselves
at btrfs_do_write_iter().
For a nowait write however we can end up not doing the sync ourselves at
at btrfs_do_write_iter() because the write could have been queued, and
therefore we get -EIOCBQUEUED returned from iomap in such case. That makes
us skip the sync call at btrfs_do_write_iter(), as we don't do it for
any error returned from btrfs_direct_write(). We can't simply do the call
even if -EIOCBQUEUED is returned, since that would block the task waiting
for IO, both for the data since there are bios still in progress as well
as potentially blocking when joining a log transaction and when syncing
the log (writing log trees, super blocks, etc).
So let iomap do the sync call itself and in order to avoid deadlocks for
the case of synchronous writes (without nowait), use __iomap_dio_rw() and
have ourselves call iomap_dio_complete() after unlocking the inode.
A test case will later be sent for fstests, after this is fixed in Linus'
tree.
Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Reported-by: Марк Коренберг <socketpair@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAEmTpZGRKbzc16fWPvxbr6AfFsQoLmz-Lcg-7OgJOZDboJ+SGQ@mail.gmail.com/
CC: stable@vger.kernel.org # 6.0+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-28 20:15:35 +08:00
|
|
|
ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter,
|
|
|
|
size_t done_before);
|
|
|
|
struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
|
|
|
|
size_t done_before);
|
2022-05-06 04:11:09 +08:00
|
|
|
|
2009-10-09 21:54:36 +08:00
|
|
|
extern const struct dentry_operations btrfs_dentry_operations;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2020-09-25 00:39:16 +08:00
|
|
|
/* Inode locking type flags, by default the exclusive lock is taken */
|
2022-09-09 23:27:45 +08:00
|
|
|
enum btrfs_ilock_type {
|
|
|
|
ENUM_BIT(BTRFS_ILOCK_SHARED),
|
|
|
|
ENUM_BIT(BTRFS_ILOCK_TRY),
|
|
|
|
ENUM_BIT(BTRFS_ILOCK_MMAP),
|
|
|
|
};
|
2020-09-25 00:39:16 +08:00
|
|
|
|
|
|
|
int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags);
|
|
|
|
void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags);
|
btrfs: update the number of bytes used by an inode atomically
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 19:07:34 +08:00
|
|
|
void btrfs_update_inode_bytes(struct btrfs_inode *inode,
|
|
|
|
const u64 add_bytes,
|
|
|
|
const u64 del_bytes);
|
2022-03-15 23:22:41 +08:00
|
|
|
void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 end);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
|
|
|
/* ioctl.c */
|
|
|
|
long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
|
2015-10-29 16:22:21 +08:00
|
|
|
long btrfs_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
|
2021-04-07 20:36:43 +08:00
|
|
|
int btrfs_fileattr_get(struct dentry *dentry, struct fileattr *fa);
|
|
|
|
int btrfs_fileattr_set(struct user_namespace *mnt_userns,
|
|
|
|
struct dentry *dentry, struct fileattr *fa);
|
2016-02-17 22:26:27 +08:00
|
|
|
int btrfs_ioctl_get_supported_features(void __user *arg);
|
2018-03-27 00:40:21 +08:00
|
|
|
void btrfs_sync_inode_flags_to_i_flags(struct inode *inode);
|
2019-10-02 01:57:39 +08:00
|
|
|
int __pure btrfs_is_empty_uuid(u8 *uuid);
|
2021-08-06 16:12:32 +08:00
|
|
|
int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
|
2011-05-25 03:35:30 +08:00
|
|
|
struct btrfs_ioctl_defrag_range_args *range,
|
2021-08-06 16:12:32 +08:00
|
|
|
u64 newer_than, unsigned long max_to_defrag);
|
2018-03-21 09:05:27 +08:00
|
|
|
void btrfs_get_block_group_info(struct list_head *groups_list,
|
|
|
|
struct btrfs_ioctl_space_info *space);
|
|
|
|
void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
|
2013-08-15 00:12:25 +08:00
|
|
|
struct btrfs_ioctl_balance_args *bargs);
|
|
|
|
|
2007-06-12 18:35:45 +08:00
|
|
|
/* file.c */
|
2017-11-03 07:21:50 +08:00
|
|
|
int __init btrfs_auto_defrag_init(void);
|
2018-02-20 00:24:18 +08:00
|
|
|
void __cold btrfs_auto_defrag_exit(void);
|
2011-05-25 03:35:30 +08:00
|
|
|
int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
|
2022-02-13 15:42:33 +08:00
|
|
|
struct btrfs_inode *inode, u32 extent_thresh);
|
2011-05-25 03:35:30 +08:00
|
|
|
int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
|
2012-11-26 17:26:20 +08:00
|
|
|
void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info);
|
2011-07-17 08:44:56 +08:00
|
|
|
int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
|
2009-10-02 06:43:56 +08:00
|
|
|
extern const struct file_operations btrfs_file_operations;
|
Btrfs: turbo charge fsync
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-18 01:14:17 +08:00
|
|
|
int btrfs_drop_extents(struct btrfs_trans_handle *trans,
|
2020-11-04 19:07:32 +08:00
|
|
|
struct btrfs_root *root, struct btrfs_inode *inode,
|
|
|
|
struct btrfs_drop_extents_args *args);
|
2021-02-17 21:12:47 +08:00
|
|
|
int btrfs_replace_file_extents(struct btrfs_inode *inode,
|
|
|
|
struct btrfs_path *path, const u64 start,
|
|
|
|
const u64 end,
|
2020-09-08 18:27:22 +08:00
|
|
|
struct btrfs_replace_extent_info *extent_info,
|
Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-05 18:09:50 +08:00
|
|
|
struct btrfs_trans_handle **trans_out);
|
2008-10-31 02:25:28 +08:00
|
|
|
int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
|
2017-02-20 19:50:48 +08:00
|
|
|
struct btrfs_inode *inode, u64 start, u64 end);
|
2019-08-14 07:00:02 +08:00
|
|
|
ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
|
|
|
|
const struct btrfs_ioctl_encoded_io_args *encoded);
|
2008-06-10 22:07:39 +08:00
|
|
|
int btrfs_release_file(struct inode *inode, struct file *file);
|
2020-06-03 13:55:36 +08:00
|
|
|
int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
|
2016-06-23 06:54:24 +08:00
|
|
|
size_t num_pages, loff_t pos, size_t write_bytes,
|
2020-10-14 22:55:45 +08:00
|
|
|
struct extent_state **cached, bool noreserve);
|
2014-10-10 16:43:11 +08:00
|
|
|
int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
|
2020-06-24 07:23:52 +08:00
|
|
|
int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
|
2022-09-13 03:27:46 +08:00
|
|
|
size_t *write_bytes, bool nowait);
|
2020-06-24 07:23:52 +08:00
|
|
|
void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
|
btrfs: make fiemap more efficient and accurate reporting extent sharedness
The current fiemap implementation does not scale very well with the number
of extents a file has. This is both because the main algorithm to find out
the extents has a high algorithmic complexity and because for each extent
we have to check if it's shared. This second part, checking if an extent
is shared, is significantly improved by the two previous patches in this
patchset, while the first part is improved by this specific patch. Every
now and then we get reports from users mentioning fiemap is too slow or
even unusable for files with a very large number of extents, such as the
two recent reports referred to by the Link tags at the bottom of this
change log.
To understand why the part of finding which extents a file has is very
inefficient, consider the example of doing a full ranged fiemap against
a file that has over 100K extents (normal for example for a file with
more than 10G of data and using compression, which limits the extent size
to 128K). When we enter fiemap at extent_fiemap(), the following happens:
1) Before entering the main loop, we call get_extent_skip_holes() to get
the first extent map. This leads us to btrfs_get_extent_fiemap(), which
in turn calls btrfs_get_extent(), to find the first extent map that
covers the file range [0, LLONG_MAX).
btrfs_get_extent() will first search the inode's extent map tree, to
see if we have an extent map there that covers the range. If it does
not find one, then it will search the inode's subvolume b+tree for a
fitting file extent item. After finding the file extent item, it will
allocate an extent map, fill it in with information extracted from the
file extent item, and add it to the inode's extent map tree (which
requires a search for insertion in the tree).
2) Then we enter the main loop at extent_fiemap(), emit the details of
the extent, and call again get_extent_skip_holes(), with a start
offset matching the end of the extent map we previously processed.
We end up at btrfs_get_extent() again, will search the extent map tree
and then search the subvolume b+tree for a file extent item if we could
not find an extent map in the extent tree. We allocate an extent map,
fill it in with the details in the file extent item, and then insert
it into the extent map tree (yet another search in this tree).
3) The second step is repeated over and over, until we have processed the
whole file range. Each iteration ends at btrfs_get_extent(), which
does a red black tree search on the extent map tree, then searches the
subvolume b+tree, allocates an extent map and then does another search
in the extent map tree in order to insert the extent map.
In the best scenario we have all the extent maps already in the extent
tree, and so for each extent we do a single search on a red black tree,
so we have a complexity of O(n log n).
In the worst scenario we don't have any extent map already loaded in
the extent map tree, or have very few already there. In this case the
complexity is much higher since we do:
- A red black tree search on the extent map tree, which has O(log n)
complexity, initially very fast since the tree is empty or very
small, but as we end up allocating extent maps and adding them to
the tree when we don't find them there, each subsequent search on
the tree gets slower, since it's getting bigger and bigger after
each iteration.
- A search on the subvolume b+tree, also O(log n) complexity, but it
has items for all inodes in the subvolume, not just items for our
inode. Plus on a filesystem with concurrent operations on other
inodes, we can block doing the search due to lock contention on
b+tree nodes/leaves.
- Allocate an extent map - this can block, and can also fail if we
are under serious memory pressure.
- Do another search on the extent maps red black tree, with the goal
of inserting the extent map we just allocated. Again, after every
iteration this tree is getting bigger by 1 element, so after many
iterations the searches are slower and slower.
- We will not need the allocated extent map anymore, so it's pointless
to add it to the extent map tree. It's just wasting time and memory.
In short we end up searching the extent map tree multiple times, on a
tree that is growing bigger and bigger after each iteration. And
besides that we visit the same leaf of the subvolume b+tree many times,
since a leaf with the default size of 16K can easily have more than 200
file extent items.
This is very inefficient overall. This patch changes the algorithm to
instead iterate over the subvolume b+tree, visiting each leaf only once,
and only searching in the extent map tree for file ranges that have holes
or prealloc extents, in order to figure out if we have delalloc there.
It will never allocate an extent map and add it to the extent map tree.
This is very similar to what was previously done for the lseek's hole and
data seeking features.
Also, the current implementation relying on extent maps for figuring out
which extents we have is not correct. This is because extent maps can be
merged even if they represent different extents - we do this to minimize
memory utilization and keep extent map trees smaller. For example if we
have two extents that are contiguous on disk, once we load the two extent
maps, they get merged into a single one - however if only one of the
extents is shared, we end up reporting both as shared or both as not
shared, which is incorrect.
This reproducer triggers that bug:
$ cat fiemap-bug.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a file with two 256K extents.
# Since there is no other write activity, they will be contiguous,
# and their extent maps merged, despite having two distinct extents.
xfs_io -f -c "pwrite -S 0xab 0 256K" \
-c "fsync" \
-c "pwrite -S 0xcd 256K 256K" \
-c "fsync" \
$MNT/foo
# Now clone only the second extent into another file.
xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
# Filefrag will report a single 512K extent, and say it's not shared.
echo
filefrag -v $MNT/foo
umount $MNT
Running the reproducer:
$ ./fiemap-bug.sh
wrote 262144/262144 bytes at offset 0
256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
linked 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
Filesystem type is: 9123683e
File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 3328.. 3455: 128: last,eof
/mnt/sdj/foo: 1 extent found
We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.
This patch is part of a larger patchset that is comprised of the following
patches:
btrfs: allow hole and data seeking to be interruptible
btrfs: make hole and data seeking a lot more efficient
btrfs: remove check for impossible block start for an extent map at fiemap
btrfs: remove zero length check when entering fiemap
btrfs: properly flush delalloc when entering fiemap
btrfs: allow fiemap to be interruptible
btrfs: rename btrfs_check_shared() to a more descriptive name
btrfs: speedup checking for extent sharedness during fiemap
btrfs: skip unnecessary extent buffer sharedness checks during fiemap
btrfs: make fiemap more efficient and accurate reporting extent sharedness
The patchset was tested on a machine running a non-debug kernel (Debian's
default config) and compared the tests below on a branch without the
patchset versus the same branch with the whole patchset applied.
The following test for a large compressed file without holes:
$ cat fiemap-perf-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Before patchset:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)
After patchset:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1214 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 684 milliseconds (metadata cached)
That's a speedup of about 3x for both cases (no metadata cached and all
metadata cached).
The test provided by Pavel (first Link tag at the bottom), which uses
files with a large number of holes, was also used to measure the gains,
and it consists on a small C program and a shell script to invoke it.
The C program is the following:
$ cat pavels-test.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <linux/fiemap.h>
#define FILE_INTERVAL (1<<13) /* 8Kb */
long long interval(struct timeval t1, struct timeval t2)
{
long long val = 0;
val += (t2.tv_usec - t1.tv_usec);
val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
return val;
}
int main(int argc, char **argv)
{
struct fiemap fiemap = {};
struct timeval t1, t2;
char data = 'a';
struct stat st;
int fd, off, file_size = FILE_INTERVAL;
if (argc != 3 && argc != 2) {
printf("usage: %s <path> [size]\n", argv[0]);
return 1;
}
if (argc == 3)
file_size = atoi(argv[2]);
if (file_size < FILE_INTERVAL)
file_size = FILE_INTERVAL;
file_size -= file_size % FILE_INTERVAL;
fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
if (fd < 0) {
perror("open");
return 1;
}
for (off = 0; off < file_size; off += FILE_INTERVAL) {
if (pwrite(fd, &data, 1, off) != 1) {
perror("pwrite");
close(fd);
return 1;
}
}
if (ftruncate(fd, file_size)) {
perror("ftruncate");
close(fd);
return 1;
}
if (fstat(fd, &st) < 0) {
perror("fstat");
close(fd);
return 1;
}
printf("size: %ld\n", st.st_size);
printf("actual size: %ld\n", st.st_blocks * 512);
fiemap.fm_length = FIEMAP_MAX_OFFSET;
gettimeofday(&t1, NULL);
if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
perror("fiemap");
close(fd);
return 1;
}
gettimeofday(&t2, NULL);
printf("fiemap: fm_mapped_extents = %d\n",
fiemap.fm_mapped_extents);
printf("time = %lld us\n", interval(t1, t2));
close(fd);
return 0;
}
$ gcc -o pavels_test pavels_test.c
And the wrapper shell script:
$ cat fiemap-pavels-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f -O no-holes $DEV
mount $DEV $MNT
echo
echo "*********** 256M ***********"
echo
./pavels-test $MNT/testfile $((1 << 28))
echo
./pavels-test $MNT/testfile $((1 << 28))
echo
echo "*********** 512M ***********"
echo
./pavels-test $MNT/testfile $((1 << 29))
echo
./pavels-test $MNT/testfile $((1 << 29))
echo
echo "*********** 1G ***********"
echo
./pavels-test $MNT/testfile $((1 << 30))
echo
./pavels-test $MNT/testfile $((1 << 30))
umount $MNT
Running his reproducer before applying the patchset:
*********** 256M ***********
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4003133 us
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4895330 us
*********** 512M ***********
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 30123675 us
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 33450934 us
*********** 1G ***********
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 224924074 us
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 217239242 us
Running it after applying the patchset:
*********** 256M ***********
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29475 us
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29307 us
*********** 512M ***********
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 58996 us
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 59115 us
*********** 1G ***********
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 116251
time = 124141 us
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 119387 us
The speedup is massive, both on the first fiemap call and on the second
one as well, as his test creates files with many holes and small extents
(every extent follows a hole and precedes another hole).
For the 256M file we go from 4 seconds down to 29 milliseconds in the
first run, and then from 4.9 seconds down to 29 milliseconds again in the
second run, a speedup of 138x and 169x, respectively.
For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
first run, and then from 33.5 seconds down to 59 milliseconds again in the
second run, a speedup of 510x and 568x, respectively.
For the 1G file, we go from 225 seconds down to 124 milliseconds in the
first run, and then from 217 seconds down to 119 milliseconds in the
second run, a speedup of 1815x and 1824x, respectively.
Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-01 21:18:30 +08:00
|
|
|
bool btrfs_find_delalloc_in_range(struct btrfs_inode *inode, u64 start, u64 end,
|
|
|
|
u64 *delalloc_start_ret, u64 *delalloc_end_ret);
|
2008-06-10 22:07:39 +08:00
|
|
|
|
2007-08-08 04:15:09 +08:00
|
|
|
/* tree-defrag.c */
|
|
|
|
int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
|
2013-02-01 02:21:12 +08:00
|
|
|
struct btrfs_root *root);
|
2007-08-30 03:47:34 +08:00
|
|
|
|
2007-12-22 05:27:24 +08:00
|
|
|
/* super.c */
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
|
2016-01-19 10:23:03 +08:00
|
|
|
unsigned long new_flags);
|
2008-06-10 22:07:39 +08:00
|
|
|
int btrfs_sync_fs(struct super_block *sb, int wait);
|
2020-02-21 21:56:12 +08:00
|
|
|
char *btrfs_get_subvol_name_from_objectid(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 subvol_objectid);
|
2012-07-31 05:40:13 +08:00
|
|
|
|
2021-02-25 09:18:14 +08:00
|
|
|
#if BITS_PER_LONG == 32
|
|
|
|
#define BTRFS_32BIT_MAX_FILE_SIZE (((u64)ULONG_MAX + 1) << PAGE_SHIFT)
|
|
|
|
/*
|
|
|
|
* The warning threshold is 5/8th of the MAX_LFS_FILESIZE that limits the logical
|
|
|
|
* addresses of extents.
|
|
|
|
*
|
|
|
|
* For 4K page size it's about 10T, for 64K it's 160T.
|
|
|
|
*/
|
|
|
|
#define BTRFS_32BIT_EARLY_WARN_THRESHOLD (BTRFS_32BIT_MAX_FILE_SIZE * 5 / 8)
|
|
|
|
void btrfs_warn_32bit_limit(struct btrfs_fs_info *fs_info);
|
|
|
|
void btrfs_err_32bit_limit(struct btrfs_fs_info *fs_info);
|
|
|
|
#endif
|
|
|
|
|
2020-12-02 14:48:04 +08:00
|
|
|
/*
|
|
|
|
* Get the correct offset inside the page of extent buffer.
|
|
|
|
*
|
|
|
|
* @eb: target extent buffer
|
|
|
|
* @start: offset inside the extent buffer
|
|
|
|
*
|
|
|
|
* Will handle both sectorsize == PAGE_SIZE and sectorsize < PAGE_SIZE cases.
|
|
|
|
*/
|
|
|
|
static inline size_t get_eb_offset_in_page(const struct extent_buffer *eb,
|
|
|
|
unsigned long offset)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* For sectorsize == PAGE_SIZE case, eb->start will always be aligned
|
|
|
|
* to PAGE_SIZE, thus adding it won't cause any difference.
|
|
|
|
*
|
|
|
|
* For sectorsize < PAGE_SIZE, we must only read the data that belongs
|
|
|
|
* to the eb, thus we have to take the eb->start into consideration.
|
|
|
|
*/
|
|
|
|
return offset_in_page(offset + eb->start);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long get_eb_page_index(unsigned long offset)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* For sectorsize == PAGE_SIZE case, plain >> PAGE_SHIFT is enough.
|
|
|
|
*
|
|
|
|
* For sectorsize < PAGE_SIZE case, we only support 64K PAGE_SIZE,
|
|
|
|
* and have ensured that all tree blocks are contained in one page,
|
|
|
|
* thus we always get index == 0.
|
|
|
|
*/
|
|
|
|
return offset >> PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2018-11-19 17:38:16 +08:00
|
|
|
/*
|
|
|
|
* Use that for functions that are conditionally exported for sanity tests but
|
|
|
|
* otherwise static
|
|
|
|
*/
|
|
|
|
#ifndef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
#define EXPORT_FOR_TESTS static
|
|
|
|
#else
|
|
|
|
#define EXPORT_FOR_TESTS
|
|
|
|
#endif
|
|
|
|
|
2008-07-25 00:16:36 +08:00
|
|
|
/* acl.c */
|
2009-10-14 01:50:18 +08:00
|
|
|
#ifdef CONFIG_BTRFS_FS_POSIX_ACL
|
2021-08-19 04:08:24 +08:00
|
|
|
struct posix_acl *btrfs_get_acl(struct inode *inode, int type, bool rcu);
|
2021-01-21 21:19:43 +08:00
|
|
|
int btrfs_set_acl(struct user_namespace *mnt_userns, struct inode *inode,
|
|
|
|
struct posix_acl *acl, int type);
|
2022-03-15 09:12:34 +08:00
|
|
|
int __btrfs_set_acl(struct btrfs_trans_handle *trans, struct inode *inode,
|
|
|
|
struct posix_acl *acl, int type);
|
2011-07-14 11:17:39 +08:00
|
|
|
#else
|
2011-08-03 15:14:05 +08:00
|
|
|
#define btrfs_get_acl NULL
|
2013-12-20 21:16:43 +08:00
|
|
|
#define btrfs_set_acl NULL
|
2022-03-15 09:12:34 +08:00
|
|
|
static inline int __btrfs_set_acl(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode, struct posix_acl *acl,
|
|
|
|
int type)
|
2011-07-14 11:17:39 +08:00
|
|
|
{
|
2022-03-15 09:12:34 +08:00
|
|
|
return -EOPNOTSUPP;
|
2011-07-14 11:17:39 +08:00
|
|
|
}
|
|
|
|
#endif
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
/* relocation.c */
|
2016-06-22 09:16:51 +08:00
|
|
|
int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
|
|
|
int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2022-02-19 03:56:12 +08:00
|
|
|
int btrfs_recover_relocation(struct btrfs_fs_info *fs_info);
|
2020-06-03 13:55:04 +08:00
|
|
|
int btrfs_reloc_clone_csums(struct btrfs_inode *inode, u64 file_pos, u64 len);
|
2013-08-31 03:09:51 +08:00
|
|
|
int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct extent_buffer *buf,
|
|
|
|
struct extent_buffer *cow);
|
2015-08-06 20:58:11 +08:00
|
|
|
void btrfs_reloc_pre_snapshot(struct btrfs_pending_snapshot *pending,
|
2010-05-16 22:49:59 +08:00
|
|
|
u64 *bytes_to_reserve);
|
2012-03-02 00:24:58 +08:00
|
|
|
int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
|
2010-05-16 22:49:59 +08:00
|
|
|
struct btrfs_pending_snapshot *pending);
|
2020-02-17 14:16:52 +08:00
|
|
|
int btrfs_should_cancel_balance(struct btrfs_fs_info *fs_info);
|
2020-03-06 14:04:12 +08:00
|
|
|
struct btrfs_root *find_reloc_root(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 bytenr);
|
2020-03-03 14:26:02 +08:00
|
|
|
int btrfs_should_ignore_reloc_root(struct btrfs_root *root);
|
2011-03-08 21:14:00 +08:00
|
|
|
|
|
|
|
/* scrub.c */
|
2012-11-06 00:03:39 +08:00
|
|
|
int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
|
|
|
|
u64 end, struct btrfs_scrub_progress *progress,
|
2012-11-06 01:29:28 +08:00
|
|
|
int readonly, int is_dev_replace);
|
2016-06-23 06:54:24 +08:00
|
|
|
void btrfs_scrub_pause(struct btrfs_fs_info *fs_info);
|
|
|
|
void btrfs_scrub_continue(struct btrfs_fs_info *fs_info);
|
2012-11-06 00:03:39 +08:00
|
|
|
int btrfs_scrub_cancel(struct btrfs_fs_info *info);
|
2019-03-20 23:32:55 +08:00
|
|
|
int btrfs_scrub_cancel_dev(struct btrfs_device *dev);
|
2016-06-23 06:54:24 +08:00
|
|
|
int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid,
|
2011-03-08 21:14:00 +08:00
|
|
|
struct btrfs_scrub_progress *progress);
|
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-01-30 16:46:55 +08:00
|
|
|
|
|
|
|
/* dev-replace.c */
|
|
|
|
void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
|
2014-11-25 16:39:28 +08:00
|
|
|
void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount);
|
|
|
|
|
|
|
|
static inline void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
btrfs_bio_counter_sub(fs_info, 1);
|
|
|
|
}
|
2011-03-08 21:14:00 +08:00
|
|
|
|
2012-05-29 23:06:54 +08:00
|
|
|
static inline int is_fstree(u64 rootid)
|
|
|
|
{
|
|
|
|
if (rootid == BTRFS_FS_TREE_OBJECTID ||
|
2015-02-27 16:24:23 +08:00
|
|
|
((s64)rootid >= (s64)BTRFS_FIRST_FREE_OBJECTID &&
|
|
|
|
!btrfs_qgroup_level(rootid)))
|
2012-05-29 23:06:54 +08:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
2013-02-10 07:38:06 +08:00
|
|
|
|
|
|
|
static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
return signal_pending(current);
|
|
|
|
}
|
|
|
|
|
2021-07-01 04:01:49 +08:00
|
|
|
/* verity.c */
|
|
|
|
#ifdef CONFIG_FS_VERITY
|
|
|
|
|
|
|
|
extern const struct fsverity_operations btrfs_verityops;
|
|
|
|
int btrfs_drop_verity_items(struct btrfs_inode *inode);
|
2022-08-16 04:54:28 +08:00
|
|
|
int btrfs_get_verity_descriptor(struct inode *inode, void *buf, size_t buf_size);
|
2021-07-01 04:01:49 +08:00
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static inline int btrfs_drop_verity_items(struct btrfs_inode *inode)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-08-16 04:54:28 +08:00
|
|
|
static inline int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
|
|
|
|
size_t buf_size)
|
|
|
|
{
|
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
|
2021-07-01 04:01:49 +08:00
|
|
|
#endif
|
|
|
|
|
2013-10-12 02:44:09 +08:00
|
|
|
/* Sanity test specific functions */
|
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
void btrfs_test_destroy_inode(struct inode *inode);
|
2018-08-17 23:48:13 +08:00
|
|
|
#endif
|
2018-04-04 01:16:55 +08:00
|
|
|
|
2021-09-09 00:19:25 +08:00
|
|
|
static inline bool btrfs_is_data_reloc_root(const struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
return root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID;
|
|
|
|
}
|
|
|
|
|
2021-04-07 19:22:13 +08:00
|
|
|
/*
|
|
|
|
* We use page status Private2 to indicate there is an ordered extent with
|
|
|
|
* unfinished IO.
|
|
|
|
*
|
|
|
|
* Rename the Private2 accessors to Ordered, to improve readability.
|
|
|
|
*/
|
|
|
|
#define PageOrdered(page) PagePrivate2(page)
|
|
|
|
#define SetPageOrdered(page) SetPagePrivate2(page)
|
|
|
|
#define ClearPageOrdered(page) ClearPagePrivate2(page)
|
2022-02-10 04:21:39 +08:00
|
|
|
#define folio_test_ordered(folio) folio_test_private_2(folio)
|
|
|
|
#define folio_set_ordered(folio) folio_set_private_2(folio)
|
|
|
|
#define folio_clear_ordered(folio) folio_clear_private_2(folio)
|
2021-04-07 19:22:13 +08:00
|
|
|
|
2007-02-02 22:18:22 +08:00
|
|
|
#endif
|