A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function relocate_block_group calls btrfs_end_transaction to release
trans when update_backref_cache returns 1, and then continues the loop
body. If btrfs_block_rsv_refill fails this time, it will jump out the
loop and the freed trans will be accessed. This may result in a
use-after-free bug. The patch assigns NULL to trans after trans is
released so that it will not be accessed.
Fixes: 0647bf564f ("Btrfs: improve forever loop when doing balance relocation")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 581c176041 ("btrfs: Validate child tree block's level and first
key") has made tree block level check mandatory.
So if tree block level doesn't match, we won't get a valid extent
buffer. The extra WARN_ON() check can be removed completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
And add one line comment explaining what we're doing for each loop.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For qgroup_trace_extent_swap(), if we find one leaf that needs to be
traced, we will also iterate all file extents and trace them.
This is OK if we're relocating data block groups, but if we're
relocating metadata block groups, balance code itself has ensured that
both subtree of file tree and reloc tree contain the same contents.
That's to say, if we're relocating metadata block groups, all file
extents in reloc and file tree should match, thus no need to trace them.
This should reduce the total number of dirty extents processed in metadata
block group balance.
[[Benchmark]] (with all previous enhancement)
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
| v4.19-rc1 | w/ patchset | diff (*)
---------------------------------------------------------------
relocated extents | 22929 | 22851 | -0.3%
qgroup dirty extents | 227757 | 140886 | -38.1%
time (sys) | 65.253s | 37.464s | -42.6%
time (real) | 74.032s | 44.722s | -39.6%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, with quota enabled during balance, we need to mark
the whole subtree dirty for quota.
E.g.
OO = Old tree blocks (from file tree)
NN = New tree blocks (from reloc tree)
File tree (src) Reloc tree (dst)
OO (a) NN (a)
/ \ / \
(b) OO OO (c) (b) NN NN (c)
/ \ / \ / \ / \
OO OO OO OO (d) OO OO OO NN (d)
For old balance + quota case, quota will mark the whole src and dst tree
dirty, including all the 3 old tree blocks in reloc tree.
It's doable for small file tree or new tree blocks are all located at
lower level.
But for large file tree or new tree blocks are all located at higher
level, this will lead to mark the whole tree dirty, and be unbelievably
slow.
This patch will change how we handle such balance with quota enabled
case.
Now we will search from (b) and (c) for any new tree blocks whose
generation is equal to @last_snapshot, and only mark them dirty.
In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
(NN(a) will be traced when COW happens for nodeptr modification). And
also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
happens for nodeptr modification.)
For above case, we could skip 3 tree blocks, but for larger tree, we can
skip tons of unmodified tree blocks, and hugely speed up balance.
This patch will introduce a new function,
btrfs_qgroup_trace_subtree_swap(), which will do the following main
work:
1) Read out real root eb
And setup basic dst_path for later calls
2) Call qgroup_trace_new_subtree_blocks()
To trace all new tree blocks in reloc tree and their counter
parts in the file tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
although it works pretty fine, it's not that easy to understand.
Especially combined with the complex btrfs backref format.
This patch adds some basic comment for the backref build part of the
code, making it less hard to read, at least for backref searching part.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
when trying to recover balance:
kernel BUG at fs/btrfs/extent-tree.c:8956!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
Call Trace:
walk_up_tree+0x172/0x1f0 [btrfs]
btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
merge_reloc_roots+0xe1/0x1d0 [btrfs]
btrfs_recover_relocation+0x3ea/0x420 [btrfs]
open_ctree+0x1af3/0x1dd0 [btrfs]
btrfs_mount_root+0x66b/0x740 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
btrfs_mount+0x16d/0x890 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
do_mount+0x1fd/0xda0
ksys_mount+0xba/0xd0
__x64_sys_mount+0x21/0x30
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
Extent tree corruption. In this particular case, reloc tree root's
owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
corrupted and we failed the owner check in walk_up_tree().
[FIX]
It's pretty hard to take care of every extent tree corruption, but at
least we can remove such BUG_ON() and exit more gracefully.
And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
the same root (which is obviously invalid), we needs to make
__del_reloc_root() more robust to detect such invalid sharing to avoid
possible NULL dereference as root->node can be NULL in this case.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two members in struct btrfs_root which indicate root's
objectid: objectid and root_key.objectid.
They are both set to the same value in __setup_root():
static void __setup_root(struct btrfs_root *root,
struct btrfs_fs_info *fs_info,
u64 objectid)
{
...
root->objectid = objectid;
...
root->root_key.objectid = objecitd;
...
}
and not changed to other value after initialization.
grep in btrfs directory shows both are used in many places:
$ grep -rI "root->root_key.objectid" | wc -l
133
$ grep -rI "root->objectid" | wc -l
55
(4.17, inc. some noise)
It is confusing to have two similar variable names and it seems
that there is no rule about which should be used in a certain case.
Since ->root_key itself is needed for tree reloc tree, let's remove
'objecitd' member and unify code to use ->root_key.objectid in all places.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using true and false here is closer to the expected semantic than using
0 and 1. No functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just get rid of pointless checks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info can be fetched from the transaction handle directly.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Invalid reloc tree can cause kernel NULL pointer dereference when btrfs
does some cleanup of the reloc roots.
It turns out that fs_info::reloc_ctl can be NULL in
btrfs_recover_relocation() as we allocate relocation control after all
reloc roots have been verified.
So when we hit: note, we haven't called set_reloc_control() thus
fs_info::reloc_ctl is still NULL.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The v0 extent type checks are the right case for the unlikely
annotations as we don't expect to ever see them, so let's give the
compiler some hint.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Following the removal of the v0 handling code let's be courteous and
print an error message when such extents are handled. In the cases
where we have a transaction just abort it, otherwise just call
btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.
In case the error handling would be too intrusive, leave the BUG_ON in
place, like extent_data_ref_count, other proper handling would catch
that earlier.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The v0 compat code was introduced in commit 5d4f98a28c
("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are using
btrfs on such an old kernel, you have much bigger problems. This coupled
with the fact that no one is likely testing/maintining this code likely
means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.
This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
we have a v0 with no support intact as a sort of safety-net.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If type of extent_inline_ref found is not expected, filesystem may have
been corrupted, should return EUCLEAN instead of EINVAL.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed bg cache.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looks like the original idea was to print the hex of the flags which is
not coded with their flag name. So use the current buf pointer bp
instead of buf.
Reaching the uknown flags should never happen, it's there just in case.
Fixes: ebce0e01b9 ("btrfs: make block group flags in balance printks human-readable")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the commit c6100a4b4e ("Btrfs: replace tree->mapping with
tree->private_data"), parameter fs_info in alloc_reloc_control is
not used. So remove it.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 581c176041 ("btrfs: Validate child tree block's level and first
key") introduced new @first_key parameter for read_tree_block(), however
caller in replace_path() is parasing wrong key to read_tree_block().
It should use parameter @first_key other than @key.
Normally it won't expose problem as @key is normally initialzied to the
same value of @first_key we expect.
However in relocation recovery case, @key can be set to (0, 0, 0), and
since no valid key in relocation tree can be (0, 0, 0), it will cause
read_tree_block() to return -EUCLEAN and interrupt relocation recovery.
Fix it by setting @first_key correctly.
Fixes: 581c176041 ("btrfs: Validate child tree block's level and first key")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Signed-off-by: David Sterba <dsterba@suse.com>
We have several reports about node pointer points to incorrect child
tree blocks, which could have even wrong owner and level but still with
valid generation and checksum.
Although btrfs check could handle it and print error message like:
leaf parent key incorrect 60670574592
Kernel doesn't have enough check on this type of corruption correctly.
At least add such check to read_tree_block() and btrfs_read_buffer(),
where we need two new parameters @level and @first_key to verify the
child tree block.
The new @level check is mandatory and all call sites are already
modified to extract expected level from its call chain.
While @first_key is optional, the following call sites are skipping such
check:
1) Root node/leaf
As ROOT_ITEM doesn't contain the first key, skip @first_key check.
2) Direct backref
Only parent bytenr and level is known and we need to resolve the key
all by ourselves, skip @first_key check.
Another note of this verification is, it needs extra info from nodeptr
or ROOT_ITEM, so it can't fit into current tree-checker framework, which
is limited to node/leaf boundary.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.
Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.
Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.
And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.
This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.
And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.
This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.
The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.
So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Essentially duplicate the error handling from the above block which
handles the !PageUptodate(page) case and additionally clear
EXTENT_BOUNDARY.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The patch from commit a7e3b975a0 ("Btrfs: fix reported number of inode
blocks") introduced a regression where if we do a buffered write starting
at position equal to or greater than the file's size and then stat(2) the
file before writeback is triggered, the number of used blocks does not
change (unless there's a prealloc/unwritten extent). Example:
$ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
$ du -h foobar
0 foobar
$ sync
$ du -h foobar
64K foobar
The first version of that patch didn't had this regression and the second
version, which was the one committed, was made only to address some
performance regression detected by the intel test robots using fs_mark.
This fixes the regression by setting the new delaloc bit in the range, and
doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
well, so that this way we set both bits at once avoiding navigation of the
inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
meaninful place, as we should set the new dellaloc bit when if we set the
delalloc bit, which happens only if we copied bytes into the pages at
__btrfs_buffered_write().
This was making some of LTP's du tests fail, which can be quickly run
using a command line like the following:
$ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt
Fixes: a7e3b975a0 ("Btrfs: fix reported number of inode blocks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now we do a lot of weird hoops around outstanding_extents in order
to keep the extent count consistent. This is because we logically
transfer the outstanding_extent count from the initial reservation
through the set_delalloc_bits. This makes it pretty difficult to get a
handle on how and when we need to mess with outstanding_extents.
Fix this by revamping the rules of how we deal with outstanding_extents.
Now instead everybody that is holding on to a delalloc extent is
required to increase the outstanding extents count for itself. This
means we'll have something like this
btrfs_delalloc_reserve_metadata - outstanding_extents = 1
btrfs_set_extent_delalloc - outstanding_extents = 2
btrfs_release_delalloc_extents - outstanding_extents = 1
for an initial file write. Now take the append write where we extend an
existing delalloc range but still under the maximum extent size
btrfs_delalloc_reserve_metadata - outstanding_extents = 2
btrfs_set_extent_delalloc
btrfs_set_bit_hook - outstanding_extents = 3
btrfs_merge_extent_hook - outstanding_extents = 2
btrfs_delalloc_release_extents - outstanding_extnets = 1
In order to make the ordered extent transition we of course must now
make ordered extents carry their own outstanding_extent reservation, so
for cow_file_range we end up with
btrfs_add_ordered_extent - outstanding_extents = 2
clear_extent_bit - outstanding_extents = 1
btrfs_remove_ordered_extent - outstanding_extents = 0
This makes all manipulations of outstanding_extents much more explicit.
Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
combined with btrfs_release_delalloc_extents, even in the error case, as
that is the only function that actually modifies the
outstanding_extents counter.
The drawback to this is now we are much more likely to have transient
cases where outstanding_extents is much larger than it actually should
be. This could happen before as we manipulated the delalloc bits, but
now it happens basically at every write. This may put more pressure on
the ENOSPC flushing code, but I think making this code simpler is worth
the cost. I have another change coming to mitigate this side-effect
somewhat.
I also added trace points for the counter manipulation. These were used
by a bpf script I wrote to help track down leak issues.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead. This will be used in
a subsequent patch.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__del_reloc_root should be called before freeing up reloc_root->node.
If not, calling __del_reloc_root() dereference reloc_root->node, causing
the system BUG.
Fixes: 6bdf131fac ("Btrfs: don't leak reloc root nodes on error")
Cc: <stable@vger.kernel.org> # 4.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The BUG_ON() can be triggered when the caller is processing an invalid
extent inline ref, e.g.
a shared data ref is offered instead of an extent data ref, such that
it tries to find a non-existent tree block and then btrfs_search_slot
returns 1 for no such item.
This replaces the BUG_ON() with a WARN() followed by calling
btrfs_print_leaf() to show more details about what's going on and
returning -EINVAL to upper callers.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a helper to report invalid value of extent inline ref
type, we need to quit gracefully instead of throwing out a kernel panic.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have a helper which can do sanity check, this converts all
btrfs_extent_inline_ref_type to it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the error handling paths in __add_reloc_root contains btrfs_panic()
followed by some other code. As the name implies what it does is print
some error message and call BUG, naturally what follow afterwards is not
invoked. So remove this extra code.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6. This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number. It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.
This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)
Task A | Task B
----------------------------------------------------------------------
Buffered_write [0, 2K) |
|- check_data_free_space() |
| |- qgroup_reserve_data() |
| Range aligned to page |
| range [0, 4K) <<< |
| 4K bytes reserved <<< |
|- copy pages to page cache |
| Buffered_write [2K, 4K)
| |- check_data_free_space()
| | |- qgroup_reserved_data()
| | Range alinged to page
| | range [0, 4K)
| | Already reserved by A <<<
| | 0 bytes reserved <<<
| |- delalloc_reserve_metadata()
| | And it *FAILED* (Maybe EQUOTA)
| |- free_reserved_data_space()
|- qgroup_free_data()
Range aligned to page range
[0, 4K)
Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)
[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.
And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.
[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.
Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.
Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.
The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.
This will lead to qgroup reserved space underflow.
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks. This works fine when everything that has
an extent_io_tree has an inode. But we are going to remove the
btree_inode, so we need to change this. Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead. This had a lot of cascading changes but
should be relatively straightforward.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
The free space cache APIs accept a root but always use the tree root.
Also, btrfs_truncate_free_space_cache accepts a root AND an inode but
the inode always points to the root anyway, so let's just pass the inode.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_inc_block_group_ro is either passed the extent root or the dev
root, but it doesn't do anything with the dev tree. Let's convert
to passing an fs_info and using the extent root.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba <dsterba@suse.com>
Patches queued up by Filipe:
The most important change is still the fix for the extent tree
corruption that happens due to balance when qgroups are enabled (a
regression introduced in 4.7 by a fix for a regression from the last
qgroups rework). This has been hitting SLE and openSUSE users and QA
very badly, where transactions keep getting aborted when running
delayed references leaving the root filesystem in RO mode and nearly
unusable. There are fixes here that allow us to run xfstests again
with the integrity checker enabled, which has been impossible since 4.8
(apparently I'm the only one running xfstests with the integrity
checker enabled, which is useful to validate dirtied leafs, like
checking if there are keys out of order, etc). The rest are just some
trivial fixes, most of them tagged for stable, and two cleanups.
Signed-off-by: Chris Mason <clm@fb.com>
Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>