[BUG]
There is a recent report that when memory pressure is high (including
cached pages), btrfs can spend most of its time on memory allocation in
btrfs_alloc_page_array() for compressed read/write.
[CAUSE]
For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
even if the bulk allocation failed (fell back to single page
allocation) we still retry but with extra memalloc_retry_wait().
If the bulk alloc only returned one page a time, we would spend a lot of
time on the retry wait.
The behavior was introduced in commit 395cb57e85 ("btrfs: wait between
incomplete batch memory allocations").
[FIX]
Although the commit mentioned that other filesystems do the wait, it's
not the case at least nowadays.
All the mainlined filesystems only call memalloc_retry_wait() if they
failed to allocate any page (not only for bulk allocation).
If there is any progress, they won't call memalloc_retry_wait() at all.
For example, xfs_buf_alloc_pages() would only call memalloc_retry_wait()
if there is no allocation progress at all, and the call is not for
metadata readahead.
So I don't believe we should call memalloc_retry_wait() unconditionally
for short allocation.
Call memalloc_retry_wait() if it fails to allocate any page for tree
block allocation (which goes with __GFP_NOFAIL and may not need the
special handling anyway), and reduce the latency for
btrfs_alloc_page_array().
Reported-by: Julian Taylor <julian.taylor@1und1.de>
Tested-by: Julian Taylor <julian.taylor@1und1.de>
Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/
Fixes: 395cb57e85 ("btrfs: wait between incomplete batch memory allocations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an ASSERT to catch a faulty delayed reference item resulting from
prematurely cleared extent buffer.
Also, add a WARN to detect if we try to dirty a ZEROOUT buffer again, which
is suspicious as its update will be lost.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs clears the content of an extent buffer marked as
EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is
introduced to prevent a write hole of an extent buffer, which is once
allocated, marked dirty, but turns out unnecessary and cleaned up within
one transaction operation.
Currently, btrfs_clear_buffer_dirty() marks the extent buffer as
EXTENT_BUFFER_ZONED_ZEROOUT, and skips the entry function. If this call
happens while the buffer is under IO (with the WRITEBACK flag set,
without the DIRTY flag), we can add the ZEROOUT flag and clear the
buffer's content just before a bio submission. As a result:
1) it can lead to adding faulty delayed reference item which leads to a
FS corrupted (EUCLEAN) error, and
2) it writes out cleared tree node on disk
The former issue is previously discussed in [1]. The corruption happens
when it runs a delayed reference update. So, on-disk data is safe.
[1] https://lore.kernel.org/linux-btrfs/3f4f2a0ff1a6c818050434288925bdcf3cd719e5.1709124777.git.naohiro.aota@wdc.com/
The latter one can reach on-disk data. But, as that node is already
processed by btrfs_clear_buffer_dirty(), that will be invalidated in the
next transaction commit anyway. So, the chance of hitting the corruption
is relatively small.
Anyway, we should skip flagging ZEROOUT on a non-DIRTY extent buffer, to
keep the content under IO intact.
Fixes: aa6313e6ff ("btrfs: zoned: don't clear dirty flag of extent buffer")
CC: stable@vger.kernel.org # 6.8
Link: https://lore.kernel.org/linux-btrfs/oadvdekkturysgfgi4qzuemd57zudeasynswurjxw3ocdfsef6@sjyufeugh63f/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.
To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:
/* (1) */
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
return 0;
/* (2) */
if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
goto done;
At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does
/* (3) */
set_extent_buffer_uptodate(eb);
/* (4) */
clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning. Unfortunately, there is
a racey interleaving:
Thread A | Thread B | Thread C
---------+----------+---------
(1) | |
| (1) |
(2) | |
(3) | |
(4) | |
| (2) |
| | (1)
When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress. This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.
Fixes: d7172f52e9 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap we may have to visit multiple leaves of the subvolume's
inode tree, and each time we are freeing and allocating an extent buffer
to use as a clone of each visited leaf. Optimize this by reusing cloned
extent buffers, to avoid the freeing and re-allocation both of the extent
buffer structure itself and more importantly of the pages attached to the
extent buffer.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).
This however introduced a race that makes us miss delalloc ranges for
file regions that are currently holes, so the caller of fiemap will not
be aware that there's data for some file regions. This can be quite
serious for some use cases - for example in coreutils versions before 9.0,
the cp program used fiemap to detect holes and data in the source file,
copying only regions with data (extents or delalloc) from the source file
to the destination file in order to preserve holes (see the documentation
for its --sparse command line option). This means that if cp was used
with a source file that had delalloc in a hole, the destination file could
end up without that data, which is effectively a data loss issue, if it
happened to hit the race described below.
The race happens like this:
1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
has delalloc in the file range [64M, 65M[, which is currently a hole;
2) Fiemap locks the inode in shared mode, then starts iterating the
inode's subvolume tree searching for file extent items, without having
the whole fiemap target range locked in the inode's io tree - the
change introduced recently by commit b0ad381fa7 ("btrfs: fix
deadlock with fiemap and extent locking"). It only locks ranges in
the io tree when it finds a hole or prealloc extent since that
commit;
3) Note that fiemap clones each leaf before using it, and this is to
avoid deadlocks when locking a file range in the inode's io tree and
the fiemap buffer is memory mapped to some file, because writing
to the page with btrfs_page_mkwrite() will wait on any ordered extent
for the page's range and the ordered extent needs to lock the range
and may need to modify the same leaf, therefore leading to a deadlock
on the leaf;
4) While iterating the file extent items in the cloned leaf before
finding the hole in the range [64M, 65M[, the delalloc in that range
is flushed and its ordered extent completes - meaning the corresponding
file extent item is in the inode's subvolume tree, but not present in
the cloned leaf that fiemap is iterating over;
5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
the cloned leaf (or a file extent item with disk_bytenr == 0 in case
the NO_HOLES feature is not enabled), it will lock that file range in
the inode's io tree and then search for delalloc by checking for the
EXTENT_DELALLOC bit in the io tree for that range and ordered extents
(with btrfs_find_delalloc_in_range()). But it finds nothing since the
delalloc in that range was already flushed and the ordered extent
completed and is gone - as a result fiemap will not report that there's
delalloc or an extent for the range [64M, 65M[, so user space will be
mislead into thinking that there's a hole in that range.
This could actually be sporadically triggered with test case generic/094
from fstests, which reports a missing extent/delalloc range like this:
generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
--- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000
@@ -1,3 +1,9 @@
QA output created by 094
fiemap run with sync
fiemap run without sync
+ERROR: couldn't find extent at 7
+map is 'HHDDHPPDPHPH'
+logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2
+logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)
So in order to fix this, while still avoiding deadlocks in the case where
the fiemap buffer is memory mapped to the same file, change fiemap to work
like the following:
1) Always lock the whole range in the inode's io tree before starting to
iterate the inode's subvolume tree searching for file extent items,
just like we did before commit b0ad381fa7 ("btrfs: fix deadlock with
fiemap and extent locking");
2) Now instead of writing to the fiemap buffer every time we have an extent
to report, write instead to a temporary buffer (1 page), and when that
buffer becomes full, stop iterating the file extent items, unlock the
range in the io tree, release the search path, submit all the entries
kept in that buffer to the fiemap buffer, and then resume the search
for file extent items after locking again the remainder of the range in
the io tree.
The buffer having a size of a page, allows for 146 entries in a system
with 4K pages. This is a large enough value to have a good performance
by avoiding too many restarts of the search for file extent items.
In other words this preserves the huge performance gains made in the
last two years to fiemap, while avoiding the deadlocks in case the
fiemap buffer is memory mapped to the same file (useless in practice,
but possible and exercised by fuzz testing and syzbot).
Fixes: b0ad381fa7 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can pass a valid em cache pointer down to __get_extent_map() and
drop the validity check. This avoids the special case, the call stacks
are simple:
btrfs_read_folio
btrfs_do_readpage
__get_extent_map
extent_readahead
contiguous_readpages
btrfs_do_readpage
__get_extent_map
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info and sectorsize remain the same during the loops, no need to
set them on each iteration.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a convenience helper to get a fs_info from a VFS inode pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct btrfs_inode,
btrfs_root or btrfs_fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add convenience helpers to get a fs_info from a page or folio pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct page, folio,
btrfs_root and btrfs_fs_info. The latter can't be static inlines as this
would create loop between ctree.h <-> fs.h, or the headers would have to
be restructured.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add convenience helpers to get a struct btrfs_inode from a page or folio
pointer instead of open coding the chain or intermediate BTRFS_I. This
is implemented as a macro (still with type checking) so we don't need
full definitions of struct page or address_space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The block size stored in the super block is used by subsystems outside
of btrfs and it's a copy of fs_info::sectorsize. Unify that to always
use our sectorsize, with the exception of mount where we first need to
use fixed values (4K) until we read the super block and can set the
sectorsize.
Replace all uses, in most cases it's fewer pointer indirections.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Turn set_page_extent_mapped() into a wrapper around this version.
Saves a call to compound_head() for callers who already have a folio
and removes a couple of users of page->mapping.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fstests looks for WARN_ON's in dmesg. Add WARN_ON_ONCE() to our leak
detection code (enabled only in debug builds) so that fstests will fail
if these things trip at all. This will allow us to easily catch
problems with our reference counting that may otherwise go unnoticed.
Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the conversion to folio interfaces (but without the patch to
enable larger folio allocation), there is an LTP report about observable
performance drop on metadata heavy operations.
https://lore.kernel.org/linux-btrfs/202312221750.571925bd-oliver.sang@intel.com/
This drop is caused by the extra code of calculating the
folio_size()/folio_shift(), instead of the old hard coded
PAGE_SIZE/PAGE_SHIFT.
To slightly reduce the overhead, just cache both folio_size and
folio_shift in extent_buffer.
The two new members (u32 folio_size and u8 folio_shift) are stored
inside the holes of extent_buffer. folio_size is shared with len, which
is reduced to u32. The size of eb does not change.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable @bio_offset was introduced in commit 7ffd27e378 ("btrfs:
pass bio_offset to check_data_csum() directly"), when we are still using
the same endio function for both data and metadata.
Later we had several changes to data and metadata endio functions:
- Data verification is handled by btrfs bio layer
- Split data and metadata endio paths
Now for data path we no longer do any verification in
end_bbio_data_read(), as the verification is handled by btrfs bio layer
already.
Thus there is no need for such bio_offset variable.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter @pg_offset of btrfs_get_extent() is only utilized for
inlined extent, and we already have an ASSERT() and tree-checker, to
make sure we can only get inline extent at file offset 0.
Any invalid inline extent with non-zero file offset would be rejected by
tree-checker in the first place.
Thus the @pg_offset parameter is not really necessary, just remove it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXh1GgACgkQxWXV+ddt
WDtnvA/7BN7BZ6QmwWv9UyxhgSBtzI19AXPi/kBsssnnjNuzXoHFaVHj68lQCCOB
a4YjRxAg7nmwFGHdVDTdnwXgUECzqlVkeX9cXg1ZpJy0IfP9RriGedRlC/93z7aV
pg6DnKMh2FlkibK7yO6kRBR8RYLc5aVIytqHXgUeqbaquuhj2Hh8EpqRo2X0RsoE
wDXlK0qgrU8HyrA3fHdqKYPcm1+cYABGTCwGx65iRffy8vH+KFSAr71G8jOJVEUj
DgNWJCpBjXJNs0dsKrik5oGmvLd3GDBKinNX7R2mAvMAMGWrL+fVVTVTfBS/clUT
FBiVFNYCJuphMcO3Qjs6JIuEez0GuGEeh1m+tQ8W795At1FSiINtE5J7LjmJUl5X
FuUwOUpxco1lTXBLX149Y9kk7AEOaqYxy0XbH4r5bbKyuzQegRGB9/qQX4sSaCt3
3T+Td9PvS+6Jo+CDO0dsYhG/h3bsHeHtHGR6f2CiO/m1zHDnTX9aYVcLMM3hsrMI
8OUEy1jkuKnDZQuZuIWES/3V9FlJL34dR3Cb236Pv/yIH1iujIc27g0qXrC1vzPg
wnUL1wheLQ9IRLedXoiHtX2Y2ZfFQGQDrIKNCJFD+WNPkZYffih5QNTV/mPZmL80
9EbjoVTu+6rygzdD43R1RWvK0kFsY44RKhHreI8SItO+e3/0TAs=
=hMf8
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix freeing allocated id for anon dev when snapshot creation fails
- fiemap fixes:
- followup for a recent deadlock fix, ranges that fiemap can access
can still race with ordered extent completion
- make sure fiemap with SYNC flag does not race with writes
* tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix double free of anonymous device after snapshot creation failure
btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
btrfs: fix race between ordered extent completion and fiemap
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.
When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).
However by not locking the target extent range for the whole duration of
the fiemap call we can race with an ordered extent. This happens like
this:
1) The fiemap task finishes processing a file extent item that covers
the file range [512K, 1M[, and that file extent item is the last item
in the leaf currently being processed;
2) And ordered extent for the file range [768K, 2M[, in COW mode,
completes (btrfs_finish_one_ordered()) and the file extent item
covering the range [512K, 1M[ is trimmed to cover the range
[512K, 768K[ and then a new file extent item for the range [768K, 2M[
is inserted in the inode's subvolume tree;
3) The fiemap task calls fiemap_next_leaf_item(), which then calls
btrfs_next_leaf() to find the next leaf / item. This finds that the
the next key following the one we previously processed (its type is
BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
to the new file extent item inserted by the ordered extent, which has
a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;
4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
the warning:
if (cache->offset + cache->len > offset) {
WARN_ON(1);
return -EINVAL;
}
Since we get 1M > 768K, because the previously emitted entry for the
old extent covering the file range [512K, 1M[ ends at an offset that
is greater than the new extent's start offset (768K). This makes fiemap
fail with -EINVAL besides triggering the warning that produces a stack
trace like the following:
[1621.677651] ------------[ cut here ]------------
[1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.677899] Modules linked in: btrfs blake2b_generic (...)
[1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
[1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678033] Code: 2b 4c 89 63 (...)
[1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
[1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
[1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
[1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
[1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
[1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
[1621.678044] FS: 00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
[1621.678046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
[1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1621.678056] Call Trace:
[1621.678074] <TASK>
[1621.678076] ? __warn+0x80/0x130
[1621.678082] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678159] ? report_bug+0x1f4/0x200
[1621.678164] ? handle_bug+0x42/0x70
[1621.678167] ? exc_invalid_op+0x14/0x70
[1621.678170] ? asm_exc_invalid_op+0x16/0x20
[1621.678178] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678253] extent_fiemap+0x766/0xa30 [btrfs]
[1621.678339] btrfs_fiemap+0x45/0x80 [btrfs]
[1621.678420] do_vfs_ioctl+0x1e4/0x870
[1621.678431] __x64_sys_ioctl+0x6a/0xc0
[1621.678434] do_syscall_64+0x52/0x120
[1621.678445] entry_SYSCALL_64_after_hwframe+0x6e/0x76
There's also another case where before calling btrfs_next_leaf() we are
processing a hole or a prealloc extent and we had several delalloc ranges
within that hole or prealloc extent. In that case if the ordered extents
complete before we find the next key, we may end up finding an extent item
with an offset smaller than (or equals to) the offset in cache->offset.
So fix this by changing emit_fiemap_extent() to address these three
scenarios like this:
1) For the first case, steps listed above, adjust the length of the
previously cached extent so that it does not overlap with the current
extent, emit the previous one and cache the current file extent item;
2) For the second case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range,
and the current file extent item has an offset that matches the offset
in the fiemap cache, just discard what we have in the fiemap cache and
assign the current file extent item to the cache, since it's more up
to date;
3) For the third case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range
and the offset of the file extent item we just found is smaller than
what we have in the cache, just skip the current file extent item
if its range end at or behind the cached extent's end, because we may
have emitted (to the fiemap user space buffer) delalloc ranges that
overlap with the current file extent item's range. If the file extent
item's range goes beyond the end offset of the cached extent, just
emit the cached extent and cache a subrange of the file extent item,
that goes from the end offset of the cached extent to the end offset
of the file extent item.
Dealing with those cases in those ways makes everything consistent by
reflecting the current state of file extent items in the btree and
without emitting extents that have overlapping ranges (which would be
confusing and violating expectations).
This issue could be triggered often with test case generic/561, and was
also hit and reported by Wang Yugui.
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/
Fixes: b0ad381fa7 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXV2z8ACgkQxWXV+ddt
WDsudxAAoKcp1DbuOtaOzG/XVnIKt36drK4cwyZnGo9PZ9vlgT6k+T0efto4DkOF
fNWy2d/9iGy9RHy4oxZL6ceb3rcWW0NbhiKHeTPNqL4ZCPa7t6bxMWXSYBh6pYgZ
6EUS6H9Don05F7rQs8rERc+VIW6u1HFTLn4wS1cmlTyTQzZwlk9B2V6KtDtHBi0k
B4CCxY6jX2bl7BncXdYteb13Xjg7+JnWvfSKb7ouSVnL8VEcGG13QkPFNV2Xsoi2
uDDsw+QKBEcPNgBIubBUwbLS5V5vYa1H1meUFJkciaeblHVlVIMN3h7+Y8VNKMTC
qpxEo3Hx6oqmw9LdEIU7WsvFs0JJum2fKOjOx3vr1d3AiyFG6W6lrm1fKwnp3dt7
dHdAYuo8+Q4rirGlMDcEoYqpy7AcV8QqtSYajdrdpB1dqHcHhukSNqJ0dx5lYElU
HtnMXD9vLc4uJDcl9Z1aTWEmB+7nj5HwukSnTqQhgwpCZM7mrz6pe1DuD5iKinG6
Yth9QFhgcbcnchPA+3SOtC6uuje9chHo6L6eTVacnKAKyKgLY8qTTsh3zYQNVhMX
M2aWcAkizq20vKe7JFxs7M/tClyuswTjOP6RYzeayY21Rn8gGwe7uhXhv5MKpEh5
TjXjiixsrxxsyyaED7Kl69i54BvmM/TI35p7Jbx6Ln7PPD8lf8M=
=pXyL
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- Fix a deadlock in fiemap.
There was a big lock around the whole operation that can interfere
with a page fault and mkwrite.
Reducing the lock scope can also speed up fiemap
- Fix range condition for extent defragmentation which could lead to
worse layout in some cases
* tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix deadlock with fiemap and extent locking
btrfs: defrag: avoid unnecessary defrag caused by incorrect extent size
While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.
This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.
Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting. Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed. This would deadlock with the following stack trace
[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.
To fix this simply don't take the extent lock for the entire duration of
the fiemap. This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.
The only place we maintain the lock is searching delalloc. Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.
With this patch applied we no longer deadlock with my testcase.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
=0bRl
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Add Jan Kara as VFS reviewer
- Show correct device and inode numbers in proc/<pid>/maps for vma
files on stacked filesystems. This is now easily doable thanks to
the backing file work from the last cycles. This comes with
selftests
Cleanups:
- Remove a redundant might_sleep() from wait_on_inode()
- Initialize pointer with NULL, not 0
- Clarify comment on access_override_creds()
- Rework and simplify eventfd_signal() and eventfd_signal_mask()
helpers
- Process aio completions in batches to avoid needless wakeups
- Completely decouple struct mnt_idmap from namespaces. We now only
keep the actual idmapping around and don't stash references to
namespaces
- Reformat maintainer entries to indicate that a given subsystem
belongs to fs/
- Simplify fput() for files that were never opened
- Get rid of various pointless file helpers
- Rename various file helpers
- Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
last cycle
- Make relatime_need_update() return bool
- Use GFP_KERNEL instead of GFP_USER when allocating superblocks
- Replace deprecated ida_simple_*() calls with their current ida_*()
counterparts
Fixes:
- Fix comments on user namespace id mapping helpers. They aren't
kernel doc comments so they shouldn't be using /**
- s/Retuns/Returns/g in various places
- Add missing parameter documentation on can_move_mount_beneath()
- Rename i_mapping->private_data to i_mapping->i_private_data
- Fix a false-positive lockdep warning in pipe_write() for watch
queues
- Improve __fget_files_rcu() code generation to improve performance
- Only notify writer that pipe resizing has finished after setting
pipe->max_usage otherwise writers are never notified that the pipe
has been resized and hang
- Fix some kernel docs in hfsplus
- s/passs/pass/g in various places
- Fix kernel docs in ntfs
- Fix kcalloc() arguments order reported by gcc 14
- Fix uninitialized value in reiserfs"
* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
reiserfs: fix uninit-value in comp_keys
watch_queue: fix kcalloc() arguments order
ntfs: dir.c: fix kernel-doc function parameter warnings
fs: fix doc comment typo fs tree wide
selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
fs/proc: show correct device and inode numbers in /proc/pid/maps
eventfd: Remove usage of the deprecated ida_simple_xx() API
fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
fs/hfsplus: wrapper.c: fix kernel-doc warnings
fs: add Jan Kara as reviewer
fs/inode: Make relatime_need_update return bool
pipe: wakeup wr_wait after setting max_usage
file: remove __receive_fd()
file: stop exposing receive_fd_user()
fs: replace f_rcuhead with f_task_work
file: remove pointless wrapper
file: s/close_fd_get_file()/file_close_fd()/g
Improve __fget_files_rcu() code generation (and thus __fget_light())
file: massage cleanup of files that failed to open
fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
...
[BUG]
Test case btrfs/002 would fail if larger folios are enabled for
metadata:
assertion failed: folio, in fs/btrfs/extent_io.c:4358
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:4358!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 30916 Comm: fsstress Tainted: G OE 6.7.0-rc3-custom+ #128
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:assert_eb_folio_uptodate+0x98/0xe0 [btrfs]
Call Trace:
<TASK>
extent_buffer_test_bit+0x3c/0x70 [btrfs]
free_space_test_bit+0xcd/0x140 [btrfs]
modify_free_space_bitmap+0x27a/0x430 [btrfs]
add_to_free_space_tree+0x8d/0x160 [btrfs]
__btrfs_free_extent.isra.0+0xef1/0x13c0 [btrfs]
__btrfs_run_delayed_refs+0x786/0x13c0 [btrfs]
btrfs_run_delayed_refs+0x33/0x120 [btrfs]
btrfs_commit_transaction+0xa2/0x1350 [btrfs]
iterate_supers+0x77/0xe0
ksys_sync+0x60/0xa0
__do_sys_sync+0xa/0x20
do_syscall_64+0x3f/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
</TASK>
[CAUSE]
The function extent_buffer_test_bit() is not folio compatible.
It still assumes the old fixed page size, when an extent buffer with
large folio passed in, only eb->folios[0] is populated.
Then if the target bit range falls in the 2nd page of the folio, then we
would check eb->folios[1], and trigger the ASSERT().
[FIX]
Just migrate eb_bitmap_offset() to folio interfaces, using the
folio_size() to replace PAGE_SIZE.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we still go the old page based iterator functions, like
bio_for_each_segment_all(), we can hit middle pages of a folio (compound
page).
In that case if we set any page flag on those middle pages, we can
easily trigger VM_BUG_ON(), as for compound page flags, they should
follow their flag policies (normally only set on leading or tail pages).
To avoid such problem in the future full folio migration, here we do:
- Change from bio_for_each_segment_all() to bio_for_each_folio_all()
This completely removes the ability to access the middle page.
- Add extra ASSERT()s for data read/write paths
To ensure we only get single paged folio for data now.
- Rename those end io functions to follow a certain schema
* end_bbio_compressed_read()
* end_bbio_compressed_write()
These two endio functions don't set any page flags, as they use pages
not mapped to any address space.
They can be very good candidates for higher order folio testing.
And they are shared between compression and encoded IO.
* end_bbio_data_read()
* end_bbio_data_write()
* end_bbio_meta_read()
* end_bbio_meta_write()
The old function names are not unified:
- end_bio_extent_writepage()
- end_bio_extent_readpage()
- extent_buffer_write_end_io()
- extent_buffer_read_end_io()
They share no schema on where the "end_*io" string should be, nor can
be confusing just using "extent_buffer" and "extent" to distinguish
data and metadata paths.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although subpage itself is conflicting with higher folio, since subpage
(sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never
need higher order folio, there is a hidden pitfall:
- btrfs_page_*() helpers
Those helpers are an abstraction to handle both subpage and non-subpage
cases, which means we're going to pass pages pointers to those helpers.
And since those helpers are shared between data and metadata paths, it's
unavoidable to let them to handle folios, including higher order
folios).
Meanwhile for true subpage case, we should only have a single page
backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert()
to ensure that.
Also since those helpers are shared between both data and metadata, add
some extra ASSERT()s for data path to make sure we only get single page
backed folio for now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes as case in "btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method".
We have been seeing panics in the CI for the subpage stuff recently, it
happens on btrfs/187 but could potentially happen anywhere.
In the subpage case, if we race with somebody else inserting the same
extent buffer, the error case will end up calling
detach_extent_buffer_page() on the page twice.
This is done first in the bit
for (int i = 0; i < attached; i++)
detach_extent_buffer_page(eb, eb->pages[i];
and then again in btrfs_release_extent_buffer().
This works fine for !subpage because we're the only person who ever has
ourselves on the private, and so when we do the initial
detach_extent_buffer_page() we know we've completely removed it.
However for subpage we could be using this page private elsewhere, so
this results in a double put on the subpage, which can result in an
early freeing.
The fix here is to clear eb->pages[i] for everything we detach. Then
anything still attached to the eb is freed in
btrfs_release_extent_buffer().
Because of this change we must update
btrfs_release_extent_buffer_pages() to not use num_extent_folios,
because it assumes eb->folio[0] is set properly. Since this is only
interested in freeing any pages we have on the extent buffer we can
simply use INLINE_EXTENT_BUFFER_PAGES.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have migrated extent_buffer::pages[] to folios[], we're
still mostly using the folio_page() help to grab the page.
This patch would do the following cleanups for metadata:
- Introduce num_extent_folios() helper
This is to replace most num_extent_pages() callers.
- Use num_extent_folios() to iterate future large folios
This allows us to use things like
bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags
for the folio (aka the leading/tailing page), which reduces the loop
iteration to 1 for large folios.
- Change metadata related functions to use folio pointers
Including their function name, involving:
* attach_extent_buffer_page()
* detach_extent_buffer_page()
* page_range_has_eb()
* btrfs_release_extent_buffer_pages()
* btree_clear_page_dirty()
* btrfs_page_inc_eb_refs()
* btrfs_page_dec_eb_refs()
- Change btrfs_is_subpage() to accept an address_space pointer
This is to allow both page->mapping and folio->mapping to be utilized.
As data is still using the old per-page code, and may keep so for a
while.
- Special corner case place holder for future order mismatches between
extent buffer and inode filemap
For now it's just a block of comments and a dead ASSERT(), no real
handling yet.
The subpage code would still go page, just because subpage and large
folio are conflicting conditions, thus we don't need to bother subpage
with higher order folios at all. Just folio_page(folio, 0) would be
enough.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor styling tweaks ]
Signed-off-by: David Sterba <dsterba@suse.com>
For now extent_buffer::pages[] are still only accepting single page
pointer, thus we can migrate to folios pretty easily.
As for single page, page and folio are 1:1 mapped, including their page
flags.
This patch would just do the conversion from struct page to struct
folio, providing the first step to higher order folio in the future.
This conversion is pretty simple:
- extent_buffer::pages[] -> extent_buffer::folios[]
- page_address(eb->pages[i]) -> folio_address(eb->pages[i])
- eb->pages[i] -> folio_page(eb->folios[i], 0)
There would be more specific cleanups preparing for the incoming higher
order folio support.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently alloc_extent_buffer() utilizes find_or_create_page() to
allocate one page a time for an extent buffer.
This method has the following disadvantages:
- find_or_create_page() is the legacy way of allocating new pages
With the new folio infrastructure, find_or_create_page() is just
redirected to filemap_get_folio().
- Lacks the way to support higher order (order >= 1) folios
As we can not yet let filemap give us a higher order folio.
This patch would change the workflow by the following way:
Old | new
-----------------------------------+-------------------------------------
| ret = btrfs_alloc_page_array();
for (i = 0; i < num_pages; i++) { | for (i = 0; i < num_pages; i++) {
p = find_or_create_page(); | ret = filemap_add_folio();
/* Attach page private */ | /* Reuse page cache if needed */
/* Reused eb if needed */ |
| /* Attach page private and
| reuse eb if needed */
| }
By this we split the page allocation and private attaching into two
parts, allowing future updates to each part more easily, and migrate to
folio interfaces (especially for possible higher order folios).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs extent buffer helpers are doing all the cross-page
handling, as there is no guarantee that all those eb pages are
contiguous.
However on systems with enough memory, there is a very high chance the
page cache for btree_inode are allocated with physically contiguous
pages.
In that case, we can skip all the complex cross-page handling, thus
speeding up the code.
This patch adds a new member, extent_buffer::addr, which is only set to
non-NULL if all the extent buffer pages are physically contiguous.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use memset_page() in memset_extent_buffer() instead of opencoding it.
This does not not change any functionality.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One a zoned filesystem, never clear the dirty flag of an extent buffer,
but instead mark it as zeroout.
On writeout, when encountering a marked extent_buffer, zero it out.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_BUFFER_ZONED_ZEROOUT better describes the state of the extent buffer,
namely it is written as all zeros. This is needed in zoned mode, to
preserve I/O ordering.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As a cleanup and preparation for future folio migration, this patch
would replace all page->private to folio version. This includes:
- PagePrivate()
-> folio_test_private()
- page->private
-> folio_get_private()
- attach_page_private()
-> folio_attach_private()
- detach_page_private()
-> folio_detach_private()
Since we're here, also remove the forced cast on page->private, since
it's (void *) already, we don't really need to do the cast.
For now even if we missed some call sites, it won't cause any problem
yet, as we're only using order 0 folio (single page), thus all those
folio/page flags should be synced.
But for the future conversion to utilize higher order folio, the page
<-> folio flag sync is no longer guaranteed, thus we have to migrate to
utilize folio flags.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for
duplicate reservations. That is two writes to that range in one
transaction shouldn't create two reservations, as the reservation will
only be freed once when the write finally goes down. Therefore, it is
never OK to clear that bit without freeing the associated qgroup
reserve. At this point, we don't want to be freeing the reserve, so mask
off the bit.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If btrfs_alloc_page_array() fail to allocate all pages but part of the
slots, then the partially allocated pages would be leaked in function
btrfs_submit_compressed_read().
[CAUSE]
As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM,
caller is responsible to free the partially allocated pages.
For the existing call sites, most of them are fine:
- btrfs_raid_bio::stripe_pages
Handled by free_raid_bio().
- extent_buffer::pages[]
Handled btrfs_release_extent_buffer_pages().
- scrub_stripe::pages[]
Handled by release_scrub_stripe().
But there is one exception in btrfs_submit_compressed_read(), if
btrfs_alloc_page_array() failed, we didn't cleanup the array and freed
the array pointer directly.
Initially there is still the error handling in commit dd137dd1f2
("btrfs: factor out allocating an array of pages"), but later in commit
544fe4a903 ("btrfs: embed a btrfs_bio into struct compressed_bio"),
the error handling is removed, leading to the possible memory leak.
[FIX]
This patch would add back the error handling first, then to prevent such
situation from happening again, also
Make btrfs_alloc_page_array() to free the allocated pages as a extra
safety net, then we don't need to add the error handling to
btrfs_submit_compressed_read().
Fixes: 544fe4a903 ("btrfs: embed a btrfs_bio into struct compressed_bio")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is hard to find where mapping->private_lock, mapping->private_list and
mapping->private_data are used, due to private_XXX being a relatively
common name for variables and structure members in the kernel. To fit
with other members of struct address_space, rename them all to have an
i_ prefix. Tested with an allmodconfig build.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.
Signed-off-by: David Sterba <dsterba@suse.com>
The existing helper test_range_bit works in two ways, checks if the whole
range contains all the bits, or stop on the first occurrence. By adding
a specific helper for the latter case, the inner loop can be simplified
and contains fewer conditionals, making it a bit faster.
There's no caller that uses the cached state pointer so this reduces the
argument count further.
Signed-off-by: David Sterba <dsterba@suse.com>
A long time ago, we had some metadata chunks which started at sector
boundary but not aligned to nodesize boundary.
This led to some older filesystems which can have tree blocks only
aligned to sectorsize, but not nodesize.
Later 'btrfs check' gained the ability to detect and warn about such tree
blocks, and kernel fixed the chunk allocation behavior, nowadays those
tree blocks should be pretty rare.
But in the future, if we want to migrate metadata to folio, we cannot
have such tree blocks, as filemap_add_folio() requires the page index to
be aligned with the folio number of pages. Such unaligned tree blocks
can lead to VM_BUG_ON().
So this patch adds extra warning for those unaligned tree blocks, as a
preparation for the future folio migration.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfsic_mount() is part of the deprecated check-integrity
functionality.
Now let's remove the main entry point of check-integrity, and thankfully
most of the check-integrity code is self-contained inside
check-integrity.c, we can safely remove the function without huge
changes to btrfs code base.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit f98b6215d7 ("btrfs: extent_io: do extra check for extent buffer
read write functions") changed how we handle invalid extent buffer range
for read_extent_buffer().
Previously if the range is invalid we just set the destination to zero,
but after the patch we do nothing and error out.
This can lead to smatch static checker errors like:
fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'.
fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'.
fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'.
fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'.
fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'.
fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'.
fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'.
Fix those warnings by reverting back to the old memset() behavior.
By this we keep the static checker happy and would still make a lot of
noise when such invalid ranges are passed in.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: f98b6215d7 ("btrfs: extent_io: do extra check for extent buffer read write functions")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>