There are no more users of btrfs_clone_chunk_map(), the last one (and
only one ever) was removed in commit 1ec17ef591 ("btrfs: zoned: fix
use-after-free in do_zone_finish()"). So remove btrfs_clone_chunk_map().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At warn_about_uncommitted_trans(), there's no need to check if the list
is empty and return, because list_for_each_entry_safe() is safe to call
for an empty list, it simply does nothing. So remove the check.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_finish_one_ordered() it's pointless to assign 0 to the 'ret'
variable because if it has a non-zero value (error), we have already
jumped to the 'out' label. So remove that redundant assignment.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mod_start and mod_len fields of struct extent_map were introduced by
commit 4e2f84e63d ("Btrfs: improve fsync by filtering extents that we
want") in order to avoid too low performance when fsyncing a file that
keeps getting extent maps merge, because it resulted in each fsync logging
again csum ranges that were already merged before.
We don't need this anymore as extent maps in the list of modified extents
are never merged with other extent maps and once we log an extent map we
remove it from the list of modified extent maps, so it's never logged
twice.
So remove the mod_start and mod_len fields from struct extent_map and use
instead the start and len fields when logging checksums in the fast fsync
path. This also makes EXTENT_FLAG_FILLING unused so remove it as well.
Running the reproducer from the commit mentioned before, with a larger
number of extents and against a null block device, so that IO is fast
and we can better see any impact from searching checksums items and
logging them, gave the following results from dd:
Before this change:
409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s
After this change:
409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s
So no changes in throughput.
The test was done in a release kernel (non-debug, Debian's default kernel
config) and its steps are the following:
$ mkfs.btrfs -f /dev/nullb0
$ mount /dev/sdb /mnt
$ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync
$ umount /mnt
This also reduces the size of struct extent_map from 128 bytes down to 112
bytes, so now we can have 36 extents maps per 4K page instead of 32.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some of the operations after the free might convert more PERTRANS
metadata. Do the freeing as late as possible to eliminate a source of
leaked PERTRANS metadata.
This helps with the pass rate of generic/269 and generic/475.
Reviewed-by: Qu Wenruo <qwu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
For both compression and decompression paths, we always require a
"struct page **pages" and "unsigned long nr_pages", this involves quite
some part of the btrfs compression paths:
- All the compression entry points
- compressed_bio structure
This affects both compression and decompression.
- async_extent structure
Unfortunately with all those involved parts, there is no good way to
split the conversion into smaller patches while still passing compiling.
So do this in one big conversion in one go.
Please note this is direct page->folio conversion, no change on the page
sized folio requirement yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper will do the same thing as btrfs_alloc_page_array(), but
with folios.
One extra difference is, there is no extra helper for bulk allocation,
thus it may not be as efficient as the page version.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since insert_inline_extent() now only accepts a single page, it's much
easier to convert it to use folio interfaces.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since our inline extent cannot accept anything larger than a sector,
there is really no need to pass all the compressed pages to
insert_inline_extent().
And just in case, expand the ASSERT()s to make sure we only try inline
with compressed size no larger than sectorsize.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have two wrappers to allocate and free a page for
compression usage:
- btrfs_alloc_compr_page()
- btrfs_free_compr_page()
The allocator would try to grab a page from the pool, and only allocate
a new page if the pool is empty.
The reclaimer would check if the pool is full, and if not full it would
put the page into the pool.
This patch converts both helpers to use folio interfaces, and allowing
further conversion of compression path to folios.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For all the supported compression algorithms, the compression path would
always need to grab the page cache, then do the compression.
Normally we would get a page reference without any problem, since the
write path should have already locked the pages in the write range.
For the sake of error handling, we should handle the page cache miss
case.
Adds a common wrapper, btrfs_compress_find_get_page(), which calls
find_get_page(), and do the error handling along with an error message.
Callers inside compression path would only need to call
btrfs_compress_find_get_page(), and error out if it returned any error.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nowadays before starting a reflink operation we do this:
1) Take the VFS lock of the inodes in exclusive mode (a rw semaphore);
2) Take the mmap lock of the inodes (struct btrfs_inode::i_mmap_lock);
3) Flush all delalloc in the source and target ranges;
4) Wait for all ordered extents in the source and target ranges to
complete;
5) Lock the source and destination ranges in the inodes' io trees.
In step 5 we lock the source range because:
1) We needed to serialize against mmap writes, but that is not needed
anymore because nowadays we do that through the inode's i_mmap_lock
(step 2). This happens since commit 8c99516a8c ("btrfs: exclude mmaps
while doing remap");
2) To serialize against a concurrent relocation and avoid generating
a delayed ref for an extent that was just dropped by relocation, see
commit d8b5524242 ("Btrfs: fix race between reflink/dedupe and
relocation").
Locking the source range however blocks any concurrent reads for that
range and makes test case generic/733 fail.
So instead of locking the source range during reflinks, make relocation
read lock the inode's i_mmap_lock, so that it serializes with a concurrent
reflink while still able to run concurrently with mmap writes and allow
concurrent reads too.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This check "if (inherit->num_qgroups > PAGE_SIZE)" is confusing and
unnecessary.
The problem with the check is that static checkers flag it as a
potential mixup of between units of bytes vs number of elements.
Fortunately, the check can safely be deleted because the next check is
correct and applies an even stricter limit:
if (size != struct_size(inherit, qgroups, inherit->num_qgroups))
return -EINVAL;
The "inherit" struct ends in a variable array of __u64 and
"inherit->num_qgroups" is the number of elements in the array. At the
start of the function we check that:
if (size < sizeof(*inherit) || size > PAGE_SIZE)
return -EINVAL;
Thus, since we verify that the whole struct fits within one page, that
means that the number of elements in the inherit->qgroups[] array must
be less than PAGE_SIZE.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use folio instead of page in put_file_data(). Add a warning in case
higher order folio is found, this will be implemented in the future.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert page references to folios and call the respective folio
functions. Since find_or_create_page() takes a mask argument, call
__filemap_get_folio() instead of filemap_grab_folio().
The patch assumes folio size is PAGE_SIZE, add a warning in case it's a
higher order that will be implemented in the future.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert usage of page to folio in prealloc_file_extent_cluster()
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We recently tracked down a race condition that triggered a read for an
extent buffer with EXTENT_BUFFER_UPTODATE already set. While this read
was in progress, other concurrent readers would see the UPTODATE bit and
return early as if the read was already complete, making accesses to the
extent buffer conflict with the read operation that was overwriting it.
Add a WARN_ON() to end_bbio_meta_read() for this situation to make
similar races easier to spot in the future.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are clearing the bit and waking up any waiters in two different
places. Factor that code out into a static helper function.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When unlocking a write lock on a drew lock, at btrfs_drew_write_unlock(),
it's pointless to wake up tasks waiting to acquire a read lock if we
didn't decrement the 'writers' counter down to 0, since a read lock can
only be acquired when the counter reaches a value of 0. Doing so is
harmless from a functional point of view, but it's not efficient due to
unnecessarily waking up tasks just for them to sleep again on the
waitqueue.
So change this to wake up readers only if we decremented the 'writers'
counter to 0.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in having a static writepages callback in inode.c that
does nothing besides calling extent_writepages from extent_io.c.
So just remove the callback at inode.c and rename extent_writepages()
to btrfs_writepages().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in having a static readahead callback in inode.c that
does nothing besides calling extent_readahead() from extent_io.c.
So just remove the callback at inode.c and rename extent_readahead()
to btrfs_readahead().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The __btrfs_tree_lock() and __btrfs_tree_read_lock() are using a naming
with a double underscore prefix, which is specially not proper for
exported functions. Remove the double underscore prefix from their name
and add the "_nested" suffix.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The functions btrfs_tree_lock() and btrfs_tree_read_lock() are very
trivial so that can be made inline and avoid call overhead, as they
are very often called inside critical sections (when searching a btree
for example, attempting to lock a child node/leaf while holding a lock
on the parent).
So make them static inline, which even reduces the size of the btrfs
module a little bit.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1718786 156276 16920 1891982 1cde8e fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1718650 156260 16920 1891830 1cddf6 fs/btrfs/btrfs.ko
Running fs_mark also showed a tiny improvement with this script:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this change:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 180894.0 10705410
16 2400000 0 228211.4 10765738
23 3600000 0 215969.6 11011072
30 4800000 0 199077.1 11145587
46 6000000 0 176624.1 11658470
After this change:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 185312.3 10708377
16 2400000 0 229320.4 10858013
23 3600000 0 217958.7 11006167
30 4800000 0 205122.9 11112899
46 6000000 0 178039.1 11438852
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating a snapshot we first check with btrfs_lookup_dir_item() if
there is a name collision in the parent directory and then return an error
if there's a collision. Then later on when trying to insert a dir item for
the snapshot we BUG_ON() if the return value is -EEXIST or -EOVERFLOW:
static noinline int create_pending_snapshot(...)
{
(...)
/* check if there is a file/dir which has the same name. */
dir_item = btrfs_lookup_dir_item(...);
(...)
ret = btrfs_insert_dir_item(...);
/* We have check then name at the beginning, so it is impossible. */
BUG_ON(ret == -EEXIST || ret == -EOVERFLOW);
if (ret) {
btrfs_abort_transaction(trans, ret);
goto fail;
}
(...)
}
It's impossible to get the -EEXIST because we previously checked for a
potential collision with btrfs_lookup_dir_item() and we know that after
that no one could have added a colliding name because at this point the
transaction is in its critical section, state TRANS_STATE_COMMIT_DOING,
so no one can join this transaction to add a colliding name and neither
can anyone start a new transaction to do that.
As for the -EOVERFLOW, that can't happen as long as we have the extended
references feature enabled, which is a mkfs default for many years now.
In either case, the BUG_ON() is excessive as we can properly deal with
any error and can abort the transaction and jump to the 'fail' label,
in which case we'll also get the useful stack trace (just like a BUG_ON())
from the abort if the error is either -EEXIST or -EOVERFLOW.
So remove the BUG_ON().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmY5LaQACgkQxWXV+ddt
WDs7aQ/8DbxhYTNqHmEv6860w/o7Sb856foIqlZ81v1r55XYIFGhTbvQntjQgvcI
Kf8+Du6ijpYeAO2Iuj/7EQP/6yA+f5NCogpW8Nsr24riUCvzjhXr49KQbVg1SsdX
i8iec0UCQlxq7RncbpGiIwgxPRMJhUEG/wRHWnGR3jOJFXSvsLJpywZbn+Yw1d7w
kcUbHEzZPZqrWPAIcifpv/7qVCd1sPN8P3mMevcWtc1diEhQlHVVF7JnCcHxrwBP
dIsNSWyt0YmgIt231GW6GKDwuQHyv870yHK9gumvpePsfcZnDBgeMuMvv0TykhgJ
BHV2gwhIK11bNala1pw1F7CX4oiiHEeI/09/nh7xopcjnULMRFItGus2dkqDagSa
ex4g48J412crWayZ5uFqAVYeO9MNufvLvCutUj1sD/teh2ymMq82gHzQO0FTu5GL
NjWLoJXXyU18BgbXTmbm5rSMycDf1BG9Hv+MdxwEFrasF2q6Lhp+EIljUxN7+n49
i9GrLWptd8sBx/GtZXhsZlWP+vPSuHqdjZe61LD4B3IgBeGDJg6tJmHv8rEFO4Ws
9nkvaDVF03pHWxWOocDIzbrkpVwOLBaDHGwjH9Cn/lgIHL+zjXVpMaKz4/klpOr8
4/ehUajrOK6Wmyoi3fKYxZACnWK5HhFHYcB8zc1R8+zt+Pj/mbk=
=2no9
-----END PGP SIGNATURE-----
Merge tag 'for-6.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two more fixes, both have some visible effects on user space:
- add check if quotas are enabled when passing qgroup inheritance
info, this affects snapper that could fail to create a snapshot
- do check for leaf/node flag WRITTEN earlier so that nodes are
completely validated before access, this used to be done by
integrity checker but it's been removed and left an unhandled case"
* tag 'for-6.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: make sure that WRITTEN is set on all metadata blocks
btrfs: qgroup: do not check qgroup inherit if qgroup is disabled
btrfs_get_bdev_and_sb() has two callers - btrfs_open_one_device(),
which asks for open to be exclusive and btrfs_get_dev_args_from_path(),
which doesn't. Currently it does set_blocksize() in all cases.
I'm rather dubious about the need to do set_blocksize() anywhere in btrfs,
to be honest - there's some access to page cache of underlying block
devices in there, but it's nowhere near the hot paths, AFAICT.
In any case, btrfs_get_dev_args_from_path() only needs to read
the on-disk superblock and copy several fields out of it; all
callers are only interested in devices that are already opened
and brought into per-filesystem set, so setting the block size
is redundant for those and actively harmful if we are given
a pathname of unrelated device.
So we only need btrfs_get_bdev_and_sb() to call set_blocksize()
when it's asked to open exclusive.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We previously would call btrfs_check_leaf() if we had the check
integrity code enabled, which meant that we could only run the extended
leaf checks if we had WRITTEN set on the header flags.
This leaves a gap in our checking, because we could end up with
corruption on disk where WRITTEN isn't set on the leaf, and then the
extended leaf checks don't get run which we rely on to validate all of
the item pointers to make sure we don't access memory outside of the
extent buffer.
However, since 732fab95ab ("btrfs: check-integrity: remove
CONFIG_BTRFS_FS_CHECK_INTEGRITY option") we no longer call
btrfs_check_leaf() from btrfs_mark_buffer_dirty(), which means we only
ever call it on blocks that are being written out, and thus have WRITTEN
set, or that are being read in, which should have WRITTEN set.
Add checks to make sure we have WRITTEN set appropriately, and then make
sure __btrfs_check_leaf() always does the item checking. This will
protect us from file systems that have been corrupted and no longer have
WRITTEN set on some of the blocks.
This was hit on a crafted image tweaking the WRITTEN bit and reported by
KASAN as out-of-bound access in the eb accessors. The example is a dir
item at the end of an eb.
[2.042] BTRFS warning (device loop1): bad eb member start: ptr 0x3fff start 30572544 member offset 16410 size 2
[2.040] general protection fault, probably for non-canonical address 0xe0009d1000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI
[2.537] KASAN: maybe wild-memory-access in range [0x0005088000000018-0x000508800000001f]
[2.729] CPU: 0 PID: 2587 Comm: mount Not tainted 6.8.2 #1
[2.729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[2.621] RIP: 0010:btrfs_get_16+0x34b/0x6d0
[2.621] RSP: 0018:ffff88810871fab8 EFLAGS: 00000206
[2.621] RAX: 0000a11000000003 RBX: ffff888104ff8720 RCX: ffff88811b2288c0
[2.621] RDX: dffffc0000000000 RSI: ffffffff81dd8aca RDI: ffff88810871f748
[2.621] RBP: 000000000000401a R08: 0000000000000001 R09: ffffed10210e3ee9
[2.621] R10: ffff88810871f74f R11: 205d323430333737 R12: 000000000000001a
[2.621] R13: 000508800000001a R14: 1ffff110210e3f5d R15: ffffffff850011e8
[2.621] FS: 00007f56ea275840(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000
[2.621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2.621] CR2: 00007febd13b75c0 CR3: 000000010bb50000 CR4: 00000000000006f0
[2.621] Call Trace:
[2.621] <TASK>
[2.621] ? show_regs+0x74/0x80
[2.621] ? die_addr+0x46/0xc0
[2.621] ? exc_general_protection+0x161/0x2a0
[2.621] ? asm_exc_general_protection+0x26/0x30
[2.621] ? btrfs_get_16+0x33a/0x6d0
[2.621] ? btrfs_get_16+0x34b/0x6d0
[2.621] ? btrfs_get_16+0x33a/0x6d0
[2.621] ? __pfx_btrfs_get_16+0x10/0x10
[2.621] ? __pfx_mutex_unlock+0x10/0x10
[2.621] btrfs_match_dir_item_name+0x101/0x1a0
[2.621] btrfs_lookup_dir_item+0x1f3/0x280
[2.621] ? __pfx_btrfs_lookup_dir_item+0x10/0x10
[2.621] btrfs_get_tree+0xd25/0x1910
Reported-by: lei lu <llfamsec@gmail.com>
CC: stable@vger.kernel.org # 6.7+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy more details from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
After kernel commit 86211eea8a ("btrfs: qgroup: validate
btrfs_qgroup_inherit parameter"), user space tool snapper will fail to
create snapshot using its timeline feature.
[CAUSE]
It turns out that, if using timeline snapper would unconditionally pass
btrfs_qgroup_inherit parameter (assigning the new snapshot to qgroup 1/0)
for snapshot creation.
In that case, since qgroup is disabled there would be no qgroup 1/0, and
btrfs_qgroup_check_inherit() would return -ENOENT and fail the whole
snapshot creation.
[FIX]
Just skip the check if qgroup is not enabled.
This is to keep the older behavior for user space tools, as if the
kernel behavior changed for user space, it is a regression of kernel.
Thankfully snapper is also fixing the behavior by detecting if qgroup is
running in the first place, so the effect should not be that huge.
Link: https://github.com/openSUSE/snapper/issues/894
Fixes: 86211eea8a ("btrfs: qgroup: validate btrfs_qgroup_inherit parameter")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYzivoACgkQxWXV+ddt
WDu4TxAAgK+W1RSvrc2xe6MfHFMi2x2pL2qM0IEcYbmjNZJDQlmGYNj3jILho62/
/mHyA5skMr9hN58FFUJveiBj3qOds/lZD0640sGGpysFJKzA4/Wdg5xJvpsQtyDM
jr6BcgZOQ+j7Pqe7zsm/sc0n5yG4P+cydnlCFMNvpRfZjg1kYIV9F92qEPAHtLCx
BoDJyHhCEqFWWyH2nALu3syTHyvGECUCBEHLFgyGcG/IXT6Oq/BpsDZPm1j72NCt
9f58OY7/2R9QJYfCjYidFGnr3EYdI5CnCOtR2sQLcRUOISOOQSni52r5tonPdpm2
7QRPyuXTiVxpM909phGJt5wwyssK/JQgxUjUo3s0U04+qXb3cRoJny3vAcGcnuyk
W7lYh08QRQa3dzZ/Q+GFxqPPovdZalTHXYMAYP7QGwLuv+fZkqh39oz6LQfw7F7c
JxEjuSCSd8lJpFyIDkirZF9lELurjgt0Zn3RNe25BLiBpeqFvTQdAYGo5wML3Ug0
kHSmZVFC2En8Ad2AahpkGToVKGgUumo4RAZDiRGIUaHEoS7XfBbnPOAtC7Z1RKTS
9N++XVtJ1/uYQiLM5afiZRtUTkA/jqjSNH/v3YYTS18SczKEOWlHnpJeQSWK0rD1
rzbKZ+2MhBL5CGQnwkhUi0u07QorvMkQhWCHpf9au9rtUggg+nU=
=zEs6
-----END PGP SIGNATURE-----
Merge tag 'for-6.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- set correct ram_bytes when splitting ordered extent. This can be
inconsistent on-disk but harmless as it's not used for calculations
and it's only advisory for compression
- fix lockdep splat when taking cleaner mutex in qgroups disable ioctl
- fix missing mutex unlock on error path when looking up sys chunk for
relocation
* tag 'for-6.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: set correct ram_bytes when splitting ordered extent
btrfs: take the cleaner_mutex earlier in qgroup disable
btrfs: add missing mutex_unlock in btrfs_relocate_sys_chunks()
[BUG]
When running generic/287, the following file extent items can be
generated:
item 16 key (258 EXTENT_DATA 2682880) itemoff 15305 itemsize 53
generation 9 type 1 (regular)
extent data disk byte 1378414592 nr 462848
extent data offset 0 nr 462848 ram 2097152
extent compression 0 (none)
Note that file extent item is not a compressed one, but its ram_bytes is
way larger than its disk_num_bytes.
According to btrfs on-disk scheme, ram_bytes should match disk_num_bytes
if it's not a compressed one.
[CAUSE]
Since commit b73a6fd1b1 ("btrfs: split partial dio bios before
submit"), for partial dio writes, we would split the ordered extent.
However the function btrfs_split_ordered_extent() doesn't update the
ram_bytes even it has already shrunk the disk_num_bytes.
Originally the function btrfs_split_ordered_extent() is only introduced
for zoned devices in commit d22002fd37 ("btrfs: zoned: split ordered
extent when bio is sent"), but later commit b73a6fd1b1 ("btrfs: split
partial dio bios before submit") makes non-zoned btrfs affected.
Thankfully for un-compressed file extent, we do not really utilize the
ram_bytes member, thus it won't cause any real problem.
[FIX]
Also update btrfs_ordered_extent::ram_bytes inside
btrfs_split_ordered_extent().
Fixes: d22002fd37 ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of my CI runs popped the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
6.9.0-rc4+ #1 Not tainted
------------------------------------------------------
btrfs/471533 is trying to acquire lock:
ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0
but task is already holding lock:
ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&fs_info->subvol_sem){++++}-{3:3}:
down_read+0x42/0x170
btrfs_rename+0x607/0xb00
btrfs_rename2+0x2e/0x70
vfs_rename+0xaf8/0xfc0
do_renameat2+0x586/0x600
__x64_sys_rename+0x43/0x50
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}:
down_write+0x3f/0xc0
btrfs_inode_lock+0x40/0x70
prealloc_file_extent_cluster+0x1b0/0x370
relocate_file_extent_cluster+0xb2/0x720
relocate_data_extent+0x107/0x160
relocate_block_group+0x442/0x550
btrfs_relocate_block_group+0x2cb/0x4b0
btrfs_relocate_chunk+0x50/0x1b0
btrfs_balance+0x92f/0x13d0
btrfs_ioctl+0x1abf/0x2600
__x64_sys_ioctl+0x97/0xd0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}:
__lock_acquire+0x13e7/0x2180
lock_acquire+0xcb/0x2e0
__mutex_lock+0xbe/0xc00
btrfs_quota_disable+0x54/0x4c0
btrfs_ioctl+0x206b/0x2600
__x64_sys_ioctl+0x97/0xd0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
&fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->subvol_sem);
lock(&sb->s_type->i_mutex_key#16);
lock(&fs_info->subvol_sem);
lock(&fs_info->cleaner_mutex);
*** DEADLOCK ***
2 locks held by btrfs/471533:
#0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600
#1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
stack backtrace:
CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1
Call Trace:
<TASK>
dump_stack_lvl+0x77/0xb0
check_noncircular+0x148/0x160
? lock_acquire+0xcb/0x2e0
__lock_acquire+0x13e7/0x2180
lock_acquire+0xcb/0x2e0
? btrfs_quota_disable+0x54/0x4c0
? lock_is_held_type+0x9a/0x110
__mutex_lock+0xbe/0xc00
? btrfs_quota_disable+0x54/0x4c0
? srso_return_thunk+0x5/0x5f
? lock_acquire+0xcb/0x2e0
? btrfs_quota_disable+0x54/0x4c0
? btrfs_quota_disable+0x54/0x4c0
btrfs_quota_disable+0x54/0x4c0
btrfs_ioctl+0x206b/0x2600
? srso_return_thunk+0x5/0x5f
? __do_sys_statfs+0x61/0x70
__x64_sys_ioctl+0x97/0xd0
do_syscall_64+0x95/0x180
? srso_return_thunk+0x5/0x5f
? reacquire_held_locks+0xd1/0x1f0
? do_user_addr_fault+0x307/0x8a0
? srso_return_thunk+0x5/0x5f
? lock_acquire+0xcb/0x2e0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? find_held_lock+0x2b/0x80
? srso_return_thunk+0x5/0x5f
? lock_release+0xca/0x2a0
? srso_return_thunk+0x5/0x5f
? do_user_addr_fault+0x35c/0x8a0
? srso_return_thunk+0x5/0x5f
? trace_hardirqs_off+0x4b/0xc0
? srso_return_thunk+0x5/0x5f
? lockdep_hardirqs_on_prepare+0xde/0x190
? srso_return_thunk+0x5/0x5f
This happens because when we call rename we already have the inode mutex
held, and then we acquire the subvol_sem if we are a subvolume. This
makes the dependency
inode lock -> subvol sem
When we're running data relocation we will preallocate space for the
data relocation inode, and we always run the relocation under the
->cleaner_mutex. This now creates the dependency of
cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem
Qgroup delete is doing this in the opposite order, it is acquiring the
subvol_sem and then it is acquiring the cleaner_mutex, which results in
this lockdep splat. This deadlock can't happen in reality, because we
won't ever rename the data reloc inode, nor is the data reloc inode a
subvolume.
However this is fairly easy to fix, simply take the cleaner mutex in the
case where we are disabling qgroups before we take the subvol_sem. This
resolves the lockdep splat.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The previous patch that replaced BUG_ON by error handling forgot to
unlock the mutex in the error path.
Link: https://lore.kernel.org/all/Zh%2fHpAGFqa7YAFuM@duo.ucw.cz
Reported-by: Pavel Machek <pavel@denx.de>
Fixes: 7411055db5 ("btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks()")
CC: stable@vger.kernel.org
Reviewed-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYnyRwACgkQxWXV+ddt
WDtnlA/9GWPYrQFBBzPZeXZHldr7grn8t8oDVVvMhSxRslrk2XYGqRVEVfZT5+Wp
wsbqdfmDXShstWkU43jxXcJAg4QxQNBBSVaKVMO5rXkM5ZHLdn78EJs5htSuy+67
n0zFqqxVr0F9LvrHqs/JJp70fr3WQtGINAkxda0JHKaMEj2nSnGjzKf6GAPomAs+
7BWQlV4cc8tQAox2MxCFx1eXTISepa9pi0ojm0R+siZGgMkmzpjJTy9WZ3EtIWQN
4LU5FMCQMsKkqUETxsEs5Va0QkEvN3SuiNsoUIJZFSArws3cwYz+u10+8Rpc/U1o
8P3H8fGxvbYpRt6QjG1GGYy0LxhyYeyGp+fNfeBF4pl2MAn2e8qeiPpPlC+Q6/U2
/Nn8+x9/FgWQKNmu76DQ1BM4WoD18mEUQrB6OYLO/9FBttLAevaEO3vxECxosBIj
wGyfXJ/r4Y2Vva+pkjreBpc7m/VwwOPGdRHkKk8mzFGqoQzSwvs/pm3ldF9dV7ud
smZ0H8vvaEDigOd4oFR2vC2wpETaCL89oS9x/NzMvqlqQaGJoD6t591c3yRwmvro
hJYT5lG6KR+ZeDgv+2ZZA2s5/2l7193pkS3toj7v1xSOkpjADPaTBg+2P5o27YyJ
UvXbEpdcCV6xWxT3Ak/bkfS1dDGfxAtwV7c/sPGvY5KtmbvjOBE=
=riVJ
-----END PGP SIGNATURE-----
Merge tag 'for-6.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix information leak by the buffer returned from LOGICAL_INO ioctl
- fix flipped condition in scrub when tracking sectors in zoned mode
- fix calculation when dropping extent range
- reinstate fallback to write uncompressed data in case of fragmented
space that could not store the entire compressed chunk
- minor fix to message formatting style to make it conforming to the
commonly used style
* tag 'for-6.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix wrong block_start calculation for btrfs_drop_extent_map_range()
btrfs: fix information leak in btrfs_ioctl_logical_to_ino()
btrfs: fallback if compressed IO fails for ENOSPC
btrfs: scrub: run relocation repair when/only needed
btrfs: remove colon from messages with state
[BUG]
During my extent_map cleanup/refactor, with extra sanity checks,
extent-map-tests::test_case_7() would not pass the checks.
The problem is, after btrfs_drop_extent_map_range(), the resulted
extent_map has a @block_start way too large.
Meanwhile my btrfs_file_extent_item based members are returning a
correct @disk_bytenr/@offset combination.
The extent map layout looks like this:
0 16K 32K 48K
| PINNED | | Regular |
The regular em at [32K, 48K) also has 32K @block_start.
Then drop range [0, 36K), which should shrink the regular one to be
[36K, 48K).
However the @block_start is incorrect, we expect 32K + 4K, but got 52K.
[CAUSE]
Inside btrfs_drop_extent_map_range() function, if we hit an extent_map
that covers the target range but is still beyond it, we need to split
that extent map into half:
|<-- drop range -->|
|<----- existing extent_map --->|
And if the extent map is not compressed, we need to forward
extent_map::block_start by the difference between the end of drop range
and the extent map start.
However in that particular case, the difference is calculated using
(start + len - em->start).
The problem is @start can be modified if the drop range covers any
pinned extent.
This leads to wrong calculation, and would be caught by my later
extent_map sanity checks, which checks the em::block_start against
btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset.
This is a regression caused by commit c962098ca4 ("btrfs: fix
incorrect splitting in btrfs_drop_extent_map_range"), which removed the
@len update for pinned extents.
[FIX]
Fix it by avoiding using @start completely, and use @end - em->start
instead, which @end is exclusive bytenr number.
And update the test case to verify the @block_start to prevent such
problem from happening.
Thankfully this is not going to lead to any data corruption, as IO path
does not utilize btrfs_drop_extent_map_range() with @skip_pinned set.
So this fix is only here for the sake of consistency/correctness.
CC: stable@vger.kernel.org # 6.5+
Fixes: c962098ca4 ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Syzbot reported the following information leak for in
btrfs_ioctl_logical_to_ino():
BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline]
BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x110 lib/usercopy.c:40
instrument_copy_to_user include/linux/instrumented.h:114 [inline]
_copy_to_user+0xbc/0x110 lib/usercopy.c:40
copy_to_user include/linux/uaccess.h:191 [inline]
btrfs_ioctl_logical_to_ino+0x440/0x750 fs/btrfs/ioctl.c:3499
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Uninit was created at:
__kmalloc_large_node+0x231/0x370 mm/slub.c:3921
__do_kmalloc_node mm/slub.c:3954 [inline]
__kmalloc_node+0xb07/0x1060 mm/slub.c:3973
kmalloc_node include/linux/slab.h:648 [inline]
kvmalloc_node+0xc0/0x2d0 mm/util.c:634
kvmalloc include/linux/slab.h:766 [inline]
init_data_container+0x49/0x1e0 fs/btrfs/backref.c:2779
btrfs_ioctl_logical_to_ino+0x17c/0x750 fs/btrfs/ioctl.c:3480
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Bytes 40-65535 of 65536 are uninitialized
Memory access of size 65536 starts at ffff888045a40000
This happens, because we're copying a 'struct btrfs_data_container' back
to user-space. This btrfs_data_container is allocated in
'init_data_container()' via kvmalloc(), which does not zero-fill the
memory.
Fix this by using kvzalloc() which zeroes out the memory on allocation.
CC: stable@vger.kernel.org # 4.14+
Reported-by: <syzbot+510a1abbb8116eeb341d@syzkaller.appspotmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <Johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYgXDMACgkQxWXV+ddt
WDsPpg//RzpLGyfFVFx+AdqIPScBvDSr6RIQAug++4OmDbIRMxzOpxKOAWThhivf
78KIms2fj9R/zLJEdUGCLQTcy8a1eWBnoeSzXoeTta2pip5cKrc9v3hJId53l0F6
BfltbVjpAKt6XHqeI0V2myrL/KHx5bApz5oNn/oEQCwiA2HBkasrYTRLEA7xMem2
hRUIXrTuIdwiyWugi84xjp9D0BxEdbTBfH6SR6RG4ESy+73gdEt4BAeDI6DzWN+D
eKUv/CthhrP7xuO8Aq9XGkwznP7lIeIwBCiV5XURLR0HztFm64vXgbPQHhwqvI43
5uhA7wifc/VE8nOysubfET6MwVEeyOptW6+25ih/9Da9VLxRK1y/Hm94JW8t6Sxi
VPgT5gz4YuE5/QaojETDLYgkkjKj7Lpe/Bs225J3QBCHu3fs/tp9kHKbUNJrcAeM
b56tiRMccLVpeoslbK4ahvQqCH4/LKBMdAqfWK5/p24JkYT/ubVP3CdLS2MOeRpV
UqDpQExuWsVJZKH8znSXXrHf2ZMYHmlA/1gRqdEmcvPF8A2vCc9aMMZHTP7v57EC
/80NJv9HQuxcUFQCl0h4WBlB+gGQtAszz+0q1X9aedauC6Hd/7LeICLCPRczJC3g
rD3J+EXiTg2MxqZWyXJXQ1Q9cQWNkQjG6o/rEhl5r5c3OGWgssk=
=ZKAP
-----END PGP SIGNATURE-----
Merge tag 'for-6.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fixup in zoned mode for out-of-order writes of metadata that are no
longer necessary, this used to be tracked in a separate list but now
the old locaion needs to be zeroed out, also add assertions
- fix bulk page allocation retry, this may stall after first failure
for compression read/write
* tag 'for-6.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: do not wait for short bulk allocation
btrfs: zoned: add ASSERT and WARN for EXTENT_BUFFER_ZONED_ZEROOUT handling
btrfs: zoned: do not flag ZEROOUT on non-dirty extent buffer
In commit b4ccace878 ("btrfs: refactor submit_compressed_extents()"), if
an async extent compressed but failed to find enough space, we changed
from falling back to an uncompressed write to just failing the write
altogether. The principle was that if there's not enough space to write
the compressed version of the data, there can't possibly be enough space
to write the larger, uncompressed version of the data.
However, this isn't necessarily true: due to fragmentation, there could
be enough discontiguous free blocks to write the uncompressed version,
but not enough contiguous free blocks to write the smaller but
unsplittable compressed version.
This has occurred to an internal workload which relied on write()'s
return value indicating there was space. While rare, it has happened a
few times.
Thus, in order to prevent early ENOSPC, re-add a fallback to
uncompressed writing.
Fixes: b4ccace878 ("btrfs: refactor submit_compressed_extents()")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Co-developed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs scrub finds an error, it reads mirrors to find correct data. If
all the errors are fixed, sctx->error_bitmap is cleared for the stripe
range. However, in the zoned mode, it runs relocation to repair scrub
errors when the bitmap is *not* empty, which is a flipped condition.
Also, it runs the relocation even if the scrub is read-only. This was
missed by a fix in commit 1f2030ff6e ("btrfs: scrub: respect the
read-only flag during repair").
The repair is only necessary when there is a repaired sector and should be
done on read-write scrub. So, tweak the condition for both regular and
zoned case.
Fixes: 54765392a1 ("btrfs: scrub: introduce helper to queue a stripe for scrub")
Fixes: 1f2030ff6e ("btrfs: scrub: respect the read-only flag during repair")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The message format in syslog is usually made of two parts:
prefix ":" message
Various tools parse the prefix up to the first ":". When there's
an additional status of a btrfs filesystem like
[5.199782] BTRFS info (device nvme1n1p1: state M): use zstd compression, level 9
where 'M' is for remount, there's one more ":" that does not conform to
the format. Remove it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a recent report that when memory pressure is high (including
cached pages), btrfs can spend most of its time on memory allocation in
btrfs_alloc_page_array() for compressed read/write.
[CAUSE]
For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
even if the bulk allocation failed (fell back to single page
allocation) we still retry but with extra memalloc_retry_wait().
If the bulk alloc only returned one page a time, we would spend a lot of
time on the retry wait.
The behavior was introduced in commit 395cb57e85 ("btrfs: wait between
incomplete batch memory allocations").
[FIX]
Although the commit mentioned that other filesystems do the wait, it's
not the case at least nowadays.
All the mainlined filesystems only call memalloc_retry_wait() if they
failed to allocate any page (not only for bulk allocation).
If there is any progress, they won't call memalloc_retry_wait() at all.
For example, xfs_buf_alloc_pages() would only call memalloc_retry_wait()
if there is no allocation progress at all, and the call is not for
metadata readahead.
So I don't believe we should call memalloc_retry_wait() unconditionally
for short allocation.
Call memalloc_retry_wait() if it fails to allocate any page for tree
block allocation (which goes with __GFP_NOFAIL and may not need the
special handling anyway), and reduce the latency for
btrfs_alloc_page_array().
Reported-by: Julian Taylor <julian.taylor@1und1.de>
Tested-by: Julian Taylor <julian.taylor@1und1.de>
Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/
Fixes: 395cb57e85 ("btrfs: wait between incomplete batch memory allocations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an ASSERT to catch a faulty delayed reference item resulting from
prematurely cleared extent buffer.
Also, add a WARN to detect if we try to dirty a ZEROOUT buffer again, which
is suspicious as its update will be lost.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs clears the content of an extent buffer marked as
EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is
introduced to prevent a write hole of an extent buffer, which is once
allocated, marked dirty, but turns out unnecessary and cleaned up within
one transaction operation.
Currently, btrfs_clear_buffer_dirty() marks the extent buffer as
EXTENT_BUFFER_ZONED_ZEROOUT, and skips the entry function. If this call
happens while the buffer is under IO (with the WRITEBACK flag set,
without the DIRTY flag), we can add the ZEROOUT flag and clear the
buffer's content just before a bio submission. As a result:
1) it can lead to adding faulty delayed reference item which leads to a
FS corrupted (EUCLEAN) error, and
2) it writes out cleared tree node on disk
The former issue is previously discussed in [1]. The corruption happens
when it runs a delayed reference update. So, on-disk data is safe.
[1] https://lore.kernel.org/linux-btrfs/3f4f2a0ff1a6c818050434288925bdcf3cd719e5.1709124777.git.naohiro.aota@wdc.com/
The latter one can reach on-disk data. But, as that node is already
processed by btrfs_clear_buffer_dirty(), that will be invalidated in the
next transaction commit anyway. So, the chance of hitting the corruption
is relatively small.
Anyway, we should skip flagging ZEROOUT on a non-DIRTY extent buffer, to
keep the content under IO intact.
Fixes: aa6313e6ff ("btrfs: zoned: don't clear dirty flag of extent buffer")
CC: stable@vger.kernel.org # 6.8
Link: https://lore.kernel.org/linux-btrfs/oadvdekkturysgfgi4qzuemd57zudeasynswurjxw3ocdfsef6@sjyufeugh63f/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYQIdIACgkQxWXV+ddt
WDvjmw/+KahIHfFt17cM5uZpiETcL9v44uT0Y69r0bMpw8Vy/cmE+rmGfyERr8YN
v68U/hpWHD2mYhxL01EHut2X/MRA4zmAcWUKVu1vk0d/9Vp/01wPJfKyvX6q388/
dFtPtzqXxj0uIwO5lRIk+dJuvShtfCps2rx/zcBUoaQYljIDNfhrWscfV4nIzqlR
BF7GX3b22rlw8q1dXAXWW+zTk3tey8Jxj+jmShyoPxcGMDK4jmNyaFu1WSIFfSdc
ns5Kii7/4tIBqpqPCr/FMGXQjdEZGw9ZTiAO4nUjtyoCTO3l/jMVYoo7llJR9dtv
Fgtej0MLlAapX2mJ65xOBO6OvCIM8VwrY+DfIDeWxtDONmrGxBUIMTJIjSq3oGEi
Mh0CbnpISGj9zQlR4raOavtgxmbdXnhdvLcp2Uv+VcJnEyCtHMmVLx9yNMKqjHje
oJHtuJiEeqlB66xZEYx3qA8SIdaJGhB/HluU9Vyg67AJTJUcCzuxZlqaC+oSOxfj
GYgY66BHD+ZKRKUFw7EylohnhvsMcmFhMSeBLzMuSaqEig4dmv4cFenad06up6c+
c0obH8oKsaA05gS3sMshmkNtBm8ms1OP2rWebjQWmmXhCOWLPqcGs5AxYeqvRdzx
eqFNKhRw+JH1mFmhEtY/Y+4OX6eTlluSxoKxZYWfAX1xvlr94U4=
=XtPw
-----END PGP SIGNATURE-----
Merge tag 'for-6.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Several fixes to qgroups that have been recently identified by test
generic/475:
- fix prealloc reserve leak in subvolume operations
- various other fixes in reservation setup, conversion or cleanup"
* tag 'for-6.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: always clear PERTRANS metadata during commit
btrfs: make btrfs_clear_delalloc_extent() free delalloc reserve
btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_trans
btrfs: record delayed inode root in transaction
btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations
btrfs: qgroup: correctly model root qgroup rsv in convert
There's a bunch of flags that are purely based on what the file
operations support while also never being conditionally set or unset.
IOW, they're not subject to change for individual files. Imho, such
flags don't need to live in f_mode they might as well live in the fops
structs itself. And the fops struct already has that lonely
mmap_supported_flags member. We might as well turn that into a generic
fop_flags member and move a few flags from FMODE_* space into FOP_*
space. That gets us four FMODE_* bits back and the ability for new
static flags that are about file ops to not have to live in FMODE_*
space but in their own FOP_* space. It's not the most beautiful thing
ever but it gets the job done. Yes, there'll be an additional pointer
chase but hopefully that won't matter for these flags.
I suspect there's a few more we can move into there and that we can also
redirect a bunch of new flag suggestions that follow this pattern into
the fop_flags field instead of f_mode.
Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
It is possible to clear a root's IN_TRANS tag from the radix tree, but
not clear its PERTRANS, if there is some error in between. Eliminate
that possibility by moving the free up to where we clear the tag.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, this call site in btrfs_clear_delalloc_extent() only converts
the reservation. We are marking it not delalloc, so I don't think it
makes sense to keep the rsv around. This is a path where we are not
sure to join a transaction, so it leads to incorrect free-ing during
umount.
Helps with the pass rate of generic/269 and generic/475.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The transaction is only able to free PERTRANS reservations for a root
once that root has been recorded with the TRANS tag on the roots radix
tree. Therefore, until we are sure that this root will get tagged, it
isn't safe to convert. Generally, this is not an issue as *some*
transaction will likely tag the root before long and this reservation
will get freed in that transaction, but technically it could stick
around until unmount and result in a warning about leaked metadata
reservation space.
This path is most exercised by running the generic/269 fstest with
CONFIG_BTRFS_DEBUG.
Fixes: a649684967 ("btrfs: fix start transaction qgroup rsv double free")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
When running delayed inode updates, we do not record the inode's root in
the transaction, but we do allocate PREALLOC and thus converted PERTRANS
space for it. To be sure we free that PERTRANS meta rsv, we must ensure
that we record the root in the transaction.
Fixes: 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.
However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.
The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:
1. any failure before record_root_in_trans is called successfully
results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
transaction owns freeing the reservation.
This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.
Fixes: e85fde5162 ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and
pertrans reservations for subvolumes when quotas are enabled. The
convert function does not properly increment pertrans after decrementing
prealloc, so the count is not accurate.
Note: we check that the fs is not read-only to mirror the logic in
qgroup_convert_meta, which checks that before adding to the pertrans rsv.
Fixes: 8287475a20 ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Use bio_list_merge_init instead of open coding bio_list_merge and
bio_list_init.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240328084147.2954434-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bitmap_set_bits() does not start with the FS' prefix and may collide
with a new generic helper one day. It operates with the FS-specific
types, so there's no change those two could do the same thing.
Just add the prefix to exclude such possible conflict.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYEfJYACgkQxWXV+ddt
WDucIg/+IupuqdLKnj6bepxX/VufnFAjecD3sRgZQQLIMfm3MQX3TzbNoPYEiAGU
tNG6jxYgkGRoyhN3aIQnsJmRFje5epYjNA5+ueUNT2/KfyKonnS2TIKQt6u7XBls
fl4SCTSNRX7w/QUNUWwyY5/86yzV4F8w19X5nVOKcp7Nz3hUBdeDZWAmMlYyHuFW
N2YRyNdCxB4Y0U9g1vgI63wFjOac0F+7RTHGsDH7ueOZ2dbtDM38lHBoCbdX3jmy
5nG7wVJZp2H/zCmzrVQJ897CMfr3h9r9Kxx8EE3JDJaJ5sMMaRh361rgsZTaGsjz
SwUzT6Z0u0hsBANSTOUZixhfX5sqArmemG+XpFu6Rq+732DqS+c4vWRSu7c8Rc8i
+4HIQNsjJqm/d1u2IyxXfuqSbaULLnyYQ8rdEx3o2AM37JnuTvOWoB+v/JqPb9TI
aG+bOPvg7GM9Sl3IoM5sR+j3bEebranZbUF+UiDEujZJiY+uiw3vMbFvyOBRWaUU
ODTpNoyCmz94mWg79hyosOjM9A/NCEkRH4oSc+YeqOvzTIBG3V+D3HxN/DX4FVTy
VDxMdptu0aPIEkUQ3nvsj4t3OKgj1w9rxZFpRYH33zJVvRqZ8VrgtC9V6zgPv3h7
suQL4s4i4EiIgAk2Z0OR23wDwg1TwXVWLGErfQVHslvhl/a2Qb4=
=Lhhp
-----END PGP SIGNATURE-----
Merge tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix race when reading extent buffer and 'uptodate' status is missed
by one thread (introduced in 6.5)
- do additional validation of devices using major:minor numbers
- zoned mode fixes:
- use zone-aware super block access during scrub
- fix use-after-free during device replace (found by KASAN)
- also delete zones that are 100% unusable to reclaim space
- extent unpinning fixes:
- fix extent map leak after error handling
- print correct range in error message
- error code and message updates
* tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix race in read_extent_buffer_pages()
btrfs: return accurate error code on open failure in open_fs_devices()
btrfs: zoned: don't skip block groups with 100% zone unusable
btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
btrfs: fix message not properly printing interval when adding extent map
btrfs: fix warning messages not printing interval at unpin_extent_range()
btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
btrfs: validate device maj:min during open
btrfs: zoned: fix use-after-free in do_zone_finish()
btrfs: zoned: use zone aware sb location for scrub
There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.
To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:
/* (1) */
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
return 0;
/* (2) */
if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
goto done;
At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does
/* (3) */
set_extent_buffer_uptodate(eb);
/* (4) */
clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning. Unfortunately, there is
a racey interleaving:
Thread A | Thread B | Thread C
---------+----------+---------
(1) | |
| (1) |
(2) | |
(3) | |
(4) | |
| (2) |
| | (1)
When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress. This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.
Fixes: d7172f52e9 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
When attempting to exclusive open a device which has no exclusive open
permission, such as a physical device associated with the flakey dm
device, the open operation will fail, resulting in a mount failure.
In this particular scenario, we erroneously return -EINVAL instead of the
correct error code provided by the bdev_open_by_path() function, which is
-EBUSY.
Fix this, by returning error code from the bdev_open_by_path() function.
With this correction, the mount error message will align with that of
ext4 and xfs.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit f4a9f21941 ("btrfs: do not delete unused block group if it may be
used soon") changed the behaviour of deleting unused block-groups on zoned
filesystems. Starting with this commit, we're using
btrfs_space_info_used() to calculate the number of used bytes in a
space_info. But btrfs_space_info_used() also accounts
btrfs_space_info::bytes_zone_unusable as used bytes.
So if a block group is 100% zone_unusable it is skipped from the deletion
step.
In order not to skip fully zone_unusable block-groups, also check if the
block-group has bytes left that can be used on a zoned filesystem.
Fixes: f4a9f21941 ("btrfs: do not delete unused block group if it may be used soon")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_add_extent_mapping(), if we failed to merge the extent map, which
is unexpected and theoretically should never happen, we use WARN_ONCE() to
log a message which is not great because we don't get information about
which filesystem it relates to in case we have multiple btrfs filesystems
mounted. So change this to use btrfs_warn() and surround the error check
with WARN_ON() so we always get a useful stack trace and the condition is
flagged as "unlikely" since it's not expected to ever happen.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_add_extent_mapping(), if we are unable to merge the existing
extent map, we print a warning message that suggests interval ranges in
the form "[X, Y)", where the first element is the inclusive start offset
of a range and the second element is the exclusive end offset. However
we end up printing the length of the ranges instead of the exclusive end
offsets. So fix this by printing the range end offsets.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At unpin_extent_range() we print warning messages that are supposed to
print an interval in the form "[X, Y)", with the first element being an
inclusive start offset and the second element being the exclusive end
offset of a range. However we end up printing the range's length instead
of the range's exclusive end offset, so fix that to avoid having confusing
and non-sense messages in case we hit one of these unexpected scenarios.
Fixes: 00deaf04df ("btrfs: log messages at unpin_extent_range() during unexpected cases")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At unpin_extent_cache() if we happen to find an extent map with an
unexpected start offset, we jump to the 'out' label and never release the
reference we added to the extent map through the call to
lookup_extent_mapping(), therefore resulting in a leak. So fix this by
moving the free_extent_map() under the 'out' label.
Fixes: c03c89f821 ("btrfs: handle errors returned from unpin_extent_cache()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Boris managed to create a device capable of changing its maj:min without
altering its device path.
Only multi-devices can be scanned. A device that gets scanned and remains
in the btrfs kernel cache might end up with an incorrect maj:min.
Despite the temp-fsid feature patch did not introduce this bug, it could
lead to issues if the above multi-device is converted to a single device
with a stale maj:min. Subsequently, attempting to mount the same device
with the correct maj:min might mistake it for another device with the same
fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature.
To address this, this patch validates the device's maj:min at the time of
device open and updates it if it has changed since the last scan.
CC: stable@vger.kernel.org # 6.7+
Fixes: a5b8a5f9f8 ("btrfs: support cloned-device mount capability")
Reported-by: Boris Burkov <boris@bur.io>
Co-developed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>#
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Shinichiro reported the following use-after-free triggered by the device
replace operation in fstests btrfs/070.
BTRFS info (device nullb1): scrub: finished on devid 1 with status: 0
==================================================================
BUG: KASAN: slab-use-after-free in do_zone_finish+0x91a/0xb90 [btrfs]
Read of size 8 at addr ffff8881543c8060 by task btrfs-cleaner/3494007
CPU: 0 PID: 3494007 Comm: btrfs-cleaner Tainted: G W 6.8.0-rc5-kts #1
Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x90
print_report+0xcf/0x670
? __virt_addr_valid+0x200/0x3e0
kasan_report+0xd8/0x110
? do_zone_finish+0x91a/0xb90 [btrfs]
? do_zone_finish+0x91a/0xb90 [btrfs]
do_zone_finish+0x91a/0xb90 [btrfs]
btrfs_delete_unused_bgs+0x5e1/0x1750 [btrfs]
? __pfx_btrfs_delete_unused_bgs+0x10/0x10 [btrfs]
? btrfs_put_root+0x2d/0x220 [btrfs]
? btrfs_clean_one_deleted_snapshot+0x299/0x430 [btrfs]
cleaner_kthread+0x21e/0x380 [btrfs]
? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
kthread+0x2e3/0x3c0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
Allocated by task 3493983:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
btrfs_alloc_device+0xb3/0x4e0 [btrfs]
device_list_add.constprop.0+0x993/0x1630 [btrfs]
btrfs_scan_one_device+0x219/0x3d0 [btrfs]
btrfs_control_ioctl+0x26e/0x310 [btrfs]
__x64_sys_ioctl+0x134/0x1b0
do_syscall_64+0x99/0x190
entry_SYSCALL_64_after_hwframe+0x6e/0x76
Freed by task 3494056:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3f/0x60
poison_slab_object+0x102/0x170
__kasan_slab_free+0x32/0x70
kfree+0x11b/0x320
btrfs_rm_dev_replace_free_srcdev+0xca/0x280 [btrfs]
btrfs_dev_replace_finishing+0xd7e/0x14f0 [btrfs]
btrfs_dev_replace_by_ioctl+0x1286/0x25a0 [btrfs]
btrfs_ioctl+0xb27/0x57d0 [btrfs]
__x64_sys_ioctl+0x134/0x1b0
do_syscall_64+0x99/0x190
entry_SYSCALL_64_after_hwframe+0x6e/0x76
The buggy address belongs to the object at ffff8881543c8000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 96 bytes inside of
freed 1024-byte region [ffff8881543c8000, ffff8881543c8400)
The buggy address belongs to the physical page:
page:00000000fe2c1285 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1543c8
head:00000000fe2c1285 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000840 ffff888100042dc0 ffffea0019e8f200 dead000000000002
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8881543c7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8881543c7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8881543c8000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8881543c8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8881543c8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
This UAF happens because we're accessing stale zone information of a
already removed btrfs_device in do_zone_finish().
The sequence of events is as follows:
btrfs_dev_replace_start
btrfs_scrub_dev
btrfs_dev_replace_finishing
btrfs_dev_replace_update_device_in_mapping_tree <-- devices replaced
btrfs_rm_dev_replace_free_srcdev
btrfs_free_device <-- device freed
cleaner_kthread
btrfs_delete_unused_bgs
btrfs_zone_finish
do_zone_finish <-- refers the freed device
The reason for this is that we're using a cached pointer to the chunk_map
from the block group, but on device replace this cached pointer can
contain stale device entries.
The staleness comes from the fact, that btrfs_block_group::physical_map is
not a pointer to a btrfs_chunk_map but a memory copy of it.
Also take the fs_info::dev_replace::rwsem to prevent
btrfs_dev_replace_update_device_in_mapping_tree() from changing the device
underneath us again.
Note: btrfs_dev_replace_update_device_in_mapping_tree() is holding
fs_info::mapping_tree_lock, but as this is a spinning read/write lock we
cannot take it as the call to blkdev_zone_mgmt() requires a memory
allocation which may not sleep.
But btrfs_dev_replace_update_device_in_mapping_tree() is always called with
the fs_info::dev_replace::rwsem held in write mode.
Many thanks to Shinichiro for analyzing the bug.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
CC: stable@vger.kernel.org # 6.8
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new statx field for (sub)volume identifiers, as implemented by
btrfs and bcachefs.
This includes bcachefs support; we'll definitely want btrfs support as
well.
Link: https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240308022914.196982-1-kent.overstreet@linux.dev
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There are reports that since version 6.7 update-grub fails to find the
device of the root on systems without initrd and on a single device.
This looks like the device name changed in the output of
/proc/self/mountinfo:
6.5-rc5 working
18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ...
6.7 not working:
17 1 0:15 / / rw,noatime - btrfs /dev/root ...
and "update-grub" shows this error:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?)
This looks like it's related to the device name, but grub-probe
recognizes the "/dev/root" path and tries to find the underlying device.
However there's a special case for some filesystems, for btrfs in
particular.
The generic root device detection heuristic is not done and it all
relies on reading the device infos by a btrfs specific ioctl. This ioctl
returns the device name as it was saved at the time of device scan (in
this case it's /dev/root).
The change in 6.7 for temp_fsid to allow several single device
filesystem to exist with the same fsid (and transparently generate a new
UUID at mount time) was to skip caching/registering such devices.
This also skipped mounted device. One step of scanning is to check if
the device name hasn't changed, and if yes then update the cached value.
This broke the grub-probe as it always read the device /dev/root and
couldn't find it in the system. A temporary workaround is to create a
symlink but this does not survive reboot.
The right fix is to allow updating the device path of a mounted
filesystem even if this is a single device one.
In the fix, check if the device's major:minor number matches with the
cached device. If they do, then we can allow the scan to happen so that
device_list_add() can take care of updating the device path. The file
descriptor remains unchanged.
This does not affect the temp_fsid feature, the UUID of the mounted
filesystem remains the same and the matching is based on device major:minor
which is unique per mounted filesystem.
This covers the path when the device (that exists for all mounted
devices) name changes, updating /dev/root to /dev/sdx. Any other single
device with filesystem and is not mounted is still skipped.
Note that if a system is booted and initial mount is done on the
/dev/root device, this will be the cached name of the device. Only after
the command "btrfs device scan" it will change as it triggers the
rename.
The fix was verified by users whose systems were affected.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353
Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/
Fixes: bc27d6f0aa ("btrfs: scan but don't register device on single device filesystem")
CC: stable@vger.kernel.org # 6.7+
Tested-by: Alex Romosan <aromosan@gmail.com>
Tested-by: CHECK_1234543212345@protonmail.com
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the moment scrub_supers() doesn't grab the super block's location via
the zoned device aware btrfs_sb_log_location() but via btrfs_sb_offset().
This leads to checksum errors on 'scrub' as we're not accessing the
correct location of the super block.
So use btrfs_sb_log_location() for getting the super blocks location on
scrub.
Reported-by: WA AM <waautomata@gmail.com>
Link: http://lore.kernel.org/linux-btrfs/CANU2Z0EvUzfYxczLgGUiREoMndE9WdQnbaawV5Fv5gNXptPUKw@mail.gmail.com
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXvUekACgkQxWXV+ddt
WDuDpA//QiTipyU+v2b0aV2iOQs66YxFU0D9suQnin2paAU9YHzT6cLr9uYLAnPE
Hs57jfZiWiCKSTVJwezJJb5azKmC9M9Fm0uSny51O7EKibcyLEDuHGrMB4C+O/9e
7PQD6K6WCRfH7PzLPeDYSK8tdHyj8hu1YbW/o/iBfQGyCxZVejCuOr/tItnO9JxY
km8pwmcREzOTGyBBjA19QKiC1hY4cARtLqtzxCBrfFcMgT2H6KbAciXzBabdMf8D
8NpP98HOFpi5sOVauSQDz8t0aQkGVWyP1yIBZ0rdQesTp7kqkXLCJOSLAw8M2Q4c
la0zywlOb4hjh0vO1gyzyJ+HPA+UZtkebeMvm0BtNukMKi2hn/AF94af4jVuR6e5
fjK79q3EU87RjluMW6wPux/MFJBJdDJrdhwZVkYFNf6yMv+L94NOcCDD3d346Hgr
hk5gOFhZ38Me9zC3/4z0NboiSxnoTk1W0hz1Je8e1vXdeIEzexkJQM6AhP8ovAjL
S9dl2po2SNLo9qvzg8rPkWKktAcI7gDZhM6mMBZispTC7JgtByHC2gd8yiys0ss0
cs0gAkL2SqOCQNNEQuf7lz7p3dhXBDkPJBmISEi4Fsnxxo7ltPECcR9kYXJ7gnqK
Hcamuc2XD8oncJ6NuqplBwmgLrjZP9I2ckUGdd5bUQPYJegx3Vw=
=dgEi
-----END PGP SIGNATURE-----
Merge tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly stabilization, refactoring and cleanup changes. There rest are
minor performance optimizations due to caching or lock contention
reduction and a few notable fixes.
Performance improvements:
- minor speedup in logging when repeatedly allocated structure is
preallocated only once, improves latency and decreases lock
contention
- minor throughput increase (+6%), reduced lock contention after
clearing delayed allocation bits, applies to several common
workload types
- skip full quota rescan if a new relation is added in the same
transaction
Fixes:
- zstd fix for inline compressed file in subpage mode, updated
version from the 6.8 time
- proper qgroup inheritance ioctl parameter validation
- more fiemap followup fixes after reduced locking done in 6.8:
- fix race when detecting delalloc ranges
Core changes:
- more debugging code:
- added assertions for a very rare crash in raid56 calculation
- tree-checker dumps page state to give more insights into
possible reference counting issues
- add checksum calculation offloading sysfs knob, for now enabled
under DEBUG only to determine a good heuristic for deciding the
offload or synchronous, depends on various factors (block group
profile, device speed) and is not as clear as initially thought
(checksum type)
- error handling improvements, added assertions
- more page to folio conversion (defrag, truncate), cached size and
shift
- preparation for more fine grained locking of sectors in subpage
mode
- cleanups and refactoring:
- include cleanups, forward declarations
- pointer-to-structure helpers
- redundant argument removals
- removed unused code
- slab cache updates, last use of SLAB_MEM_SPREAD removed"
* tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations
btrfs: fix race when detecting delalloc ranges during fiemap
btrfs: fix off-by-one chunk length calculation at contains_pending_extent()
btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent
btrfs: qgroup: validate btrfs_qgroup_inherit parameter
btrfs: include device major and minor numbers in the device scan notice
btrfs: mark btrfs_put_caching_control() static
btrfs: remove SLAB_MEM_SPREAD flag use
btrfs: qgroup: always free reserved space for extent records
btrfs: tree-checker: dump the page status if hit something wrong
btrfs: compression: remove dead comments in btrfs_compress_heuristic()
btrfs: subpage: make writer lock utilize bitmap
btrfs: subpage: make reader lock utilize bitmap
btrfs: unexport btrfs_subpage_start_writer() and btrfs_subpage_end_and_test_writer()
btrfs: pass a valid extent map cache pointer to __get_extent_map()
btrfs: merge btrfs_del_delalloc_inode() helpers
btrfs: pass btrfs_device to btrfs_scratch_superblocks()
btrfs: handle transaction commit errors in flush_reservations()
btrfs: use KMEM_CACHE() to create btrfs_free_space cache
btrfs: use KMEM_CACHE() to create delayed ref caches
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc
ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l
9NdULtniW1reuCYkc8R7dYM8S+yAwAc=
=Y1qX
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull block handle updates from Christian Brauner:
"Last cycle we changed opening of block devices, and opening a block
device would return a bdev_handle. This allowed us to implement
support for restricting and forbidding writes to mounted block
devices. It was accompanied by converting and adding helpers to
operate on bdev_handles instead of plain block devices.
That was already a good step forward but ultimately it isn't necessary
to have special purpose helpers for opening block devices internally
that return a bdev_handle.
Fundamentally, opening a block device internally should just be
equivalent to opening files. So now all internal opens of block
devices return files just as a userspace open would. Instead of
introducing a separate indirection into bdev_open_by_*() via struct
bdev_handle bdev_file_open_by_*() is made to just return a struct
file. Opening and closing a block device just becomes equivalent to
opening and closing a file.
This all works well because internally we already have a pseudo fs for
block devices and so opening block devices is simple. There's a few
places where we needed to be careful such as during boot when the
kernel is supposed to mount the rootfs directly without init doing it.
Here we need to take care to ensure that we flush out any asynchronous
file close. That's what we already do for opening, unpacking, and
closing the initramfs. So nothing new here.
The equivalence of opening and closing block devices to regular files
is a win in and of itself. But it also has various other advantages.
We can remove struct bdev_handle completely. Various low-level helpers
are now private to the block layer. Other helpers were simply
removable completely.
A follow-up series that is already reviewed build on this and makes it
possible to remove bdev->bd_inode and allows various clean ups of the
buffer head code as well. All places where we stashed a bdev_handle
now just stash a file and use simple accessors to get to the actual
block device which was already the case for bdev_handle"
* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
block: remove bdev_handle completely
block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
bdev: remove bdev pointer from struct bdev_handle
bdev: make struct bdev_handle private to the block layer
bdev: make bdev_{release, open_by_dev}() private to block layer
bdev: remove bdev_open_by_path()
reiserfs: port block device access to file
ocfs2: port block device access to file
nfs: port block device access to files
jfs: port block device access to file
f2fs: port block device access to files
ext4: port block device access to file
erofs: port device access to file
btrfs: port device access to file
bcachefs: port block device access to file
target: port block device access to file
s390: port block device access to file
nvme: port block device access to file
block2mtd: port device access to files
bcache: port block device access to files
...
During fiemap we may have to visit multiple leaves of the subvolume's
inode tree, and each time we are freeing and allocating an extent buffer
to use as a clone of each visited leaf. Optimize this by reusing cloned
extent buffers, to avoid the freeing and re-allocation both of the extent
buffer structure itself and more importantly of the pages attached to the
extent buffer.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).
This however introduced a race that makes us miss delalloc ranges for
file regions that are currently holes, so the caller of fiemap will not
be aware that there's data for some file regions. This can be quite
serious for some use cases - for example in coreutils versions before 9.0,
the cp program used fiemap to detect holes and data in the source file,
copying only regions with data (extents or delalloc) from the source file
to the destination file in order to preserve holes (see the documentation
for its --sparse command line option). This means that if cp was used
with a source file that had delalloc in a hole, the destination file could
end up without that data, which is effectively a data loss issue, if it
happened to hit the race described below.
The race happens like this:
1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
has delalloc in the file range [64M, 65M[, which is currently a hole;
2) Fiemap locks the inode in shared mode, then starts iterating the
inode's subvolume tree searching for file extent items, without having
the whole fiemap target range locked in the inode's io tree - the
change introduced recently by commit b0ad381fa7 ("btrfs: fix
deadlock with fiemap and extent locking"). It only locks ranges in
the io tree when it finds a hole or prealloc extent since that
commit;
3) Note that fiemap clones each leaf before using it, and this is to
avoid deadlocks when locking a file range in the inode's io tree and
the fiemap buffer is memory mapped to some file, because writing
to the page with btrfs_page_mkwrite() will wait on any ordered extent
for the page's range and the ordered extent needs to lock the range
and may need to modify the same leaf, therefore leading to a deadlock
on the leaf;
4) While iterating the file extent items in the cloned leaf before
finding the hole in the range [64M, 65M[, the delalloc in that range
is flushed and its ordered extent completes - meaning the corresponding
file extent item is in the inode's subvolume tree, but not present in
the cloned leaf that fiemap is iterating over;
5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
the cloned leaf (or a file extent item with disk_bytenr == 0 in case
the NO_HOLES feature is not enabled), it will lock that file range in
the inode's io tree and then search for delalloc by checking for the
EXTENT_DELALLOC bit in the io tree for that range and ordered extents
(with btrfs_find_delalloc_in_range()). But it finds nothing since the
delalloc in that range was already flushed and the ordered extent
completed and is gone - as a result fiemap will not report that there's
delalloc or an extent for the range [64M, 65M[, so user space will be
mislead into thinking that there's a hole in that range.
This could actually be sporadically triggered with test case generic/094
from fstests, which reports a missing extent/delalloc range like this:
generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
--- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000
@@ -1,3 +1,9 @@
QA output created by 094
fiemap run with sync
fiemap run without sync
+ERROR: couldn't find extent at 7
+map is 'HHDDHPPDPHPH'
+logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2
+logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)
So in order to fix this, while still avoiding deadlocks in the case where
the fiemap buffer is memory mapped to the same file, change fiemap to work
like the following:
1) Always lock the whole range in the inode's io tree before starting to
iterate the inode's subvolume tree searching for file extent items,
just like we did before commit b0ad381fa7 ("btrfs: fix deadlock with
fiemap and extent locking");
2) Now instead of writing to the fiemap buffer every time we have an extent
to report, write instead to a temporary buffer (1 page), and when that
buffer becomes full, stop iterating the file extent items, unlock the
range in the io tree, release the search path, submit all the entries
kept in that buffer to the fiemap buffer, and then resume the search
for file extent items after locking again the remainder of the range in
the io tree.
The buffer having a size of a page, allows for 146 entries in a system
with 4K pages. This is a large enough value to have a good performance
by avoiding too many restarts of the search for file extent items.
In other words this preserves the huge performance gains made in the
last two years to fiemap, while avoiding the deadlocks in case the
fiemap buffer is memory mapped to the same file (useless in practice,
but possible and exercised by fuzz testing and syzbot).
Fixes: b0ad381fa7 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At contains_pending_extent() the value of the end offset of a chunk we
found in the device's allocation state io tree is inclusive, so when
we calculate the length we pass to the in_range() macro, we must sum
1 to the expression "physical_end - physical_offset".
In practice the wrong calculation should be harmless as chunks sizes
are never 1 byte and we should never have 1 byte ranges of unallocated
space. Nevertheless fix the wrong calculation.
Reported-by: Alex Lyakas <alex.lyakas@zadara.com>
Link: https://lore.kernel.org/linux-btrfs/CAOcd+r30e-f4R-5x-S7sV22RJPe7+pgwherA6xqN2_qe7o4XTg@mail.gmail.com/
Fixes: 1c11b63eff ("btrfs: replace pending/pinned chunks lists with io tree")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently "btrfs subvolume snapshot -i <qgroupid>" would always mark the
qgroup inconsistent.
This can be annoying if the fs has a lot of snapshots, and needs qgroup
to get the accounting for the amount of bytes it can free for each
snapshot.
Although we have the new simple quote as a solution, there is also a
case where we can skip the full scan, if all the following conditions
are met:
- The source subvolume belongs to a higher level parent qgroup
- The parent qgroup already owns all its bytes exclusively
- The new snapshot is also added to the same parent qgroup
In that case, we only need to add nodesize to the parent qgroup and
avoid a full rescan.
This patch would add the extra quick accounting update for such inherit.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Currently btrfs can create subvolume with an invalid qgroup inherit
without triggering any error:
# mkfs.btrfs -O quota -f $dev
# mount $dev $mnt
# btrfs subvolume create -i 2/0 $mnt/subv1
# btrfs qgroup show -prce --sync $mnt
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 16.00KiB 16.00KiB subv1
[CAUSE]
We only do a very basic size check for btrfs_qgroup_inherit structure,
but never really verify if the values are correct.
Thus in btrfs_qgroup_inherit() function, we have to skip non-existing
qgroups, and never return any error.
[FIX]
Fix the behavior and introduce extra checks:
- Introduce early check for btrfs_qgroup_inherit structure
Not only the size, but also all the qgroup ids would be verified.
And the timing is very early, so we can return error early.
This early check is very important for snapshot creation, as snapshot
is delayed to transaction commit.
- Drop support for btrfs_qgroup_inherit::num_ref_copies and
num_excl_copies
Those two members are used to specify to copy refr/excl numbers from
other qgroups.
This would definitely mark qgroup inconsistent, and btrfs-progs has
dropped the support for them for a long time.
It's time to drop the support for kernel.
- Verify the supported btrfs_qgroup_inherit::flags
Just in case we want to add extra flags for btrfs_qgroup_inherit.
Now above subvolume creation would fail with -ENOENT other than silently
ignore the non-existing qgroup.
CC: stable@vger.kernel.org # 6.7+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To better debug issues surrounding device scans, include the device's
major and minor numbers in the device scan notice for btrfs.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_put_caching_control() is only used in block-group.c, so mark it
static.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If qgroup is marked inconsistent (e.g. caused by operations needing full
subtree rescan, like creating a snapshot and assign to a higher level
qgroup), btrfs would immediately start leaking its data reserved space.
The following script can easily reproduce it:
mkfs.btrfs -O quota -f $dev
mount $dev $mnt
btrfs subvolume create $mnt/subv1
btrfs qgroup create 1/0 $mnt
# This snapshot creation would mark qgroup inconsistent,
# as the ownership involves different higher level qgroup, thus
# we have to rescan both source and snapshot, which can be very
# time consuming, thus here btrfs just choose to mark qgroup
# inconsistent, and let users to determine when to do the rescan.
btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1
# Now this write would lead to qgroup rsv leak.
xfs_io -f -c "pwrite 0 64k" $mnt/file1
# And at unmount time, btrfs would report 64K DATA rsv space leaked.
umount $mnt
And we would have the following dmesg output for the unmount:
BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3
BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536
[CAUSE]
Since commit e15e9f43c7 ("btrfs: introduce
BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"),
we introduce a mode for btrfs qgroup to skip the timing consuming
backref walk, if the qgroup is already inconsistent.
But this skip also covered the data reserved freeing, thus the qgroup
reserved space for each newly created data extent would not be freed,
thus cause the leakage.
[FIX]
Make the data extent reserved space freeing mandatory.
The qgroup reserved space handling is way cheaper compared to the
backref walking part, and we always have the super sensitive leak
detector, thus it's definitely worth to always free the qgroup
reserved data space.
Reported-by: Fabian Vogt <fvogt@suse.com>
Fixes: e15e9f43c7 ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting")
CC: stable@vger.kernel.org # 6.1+
Link: https://bugzilla.suse.com/show_bug.cgi?id=1216196
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report about very suspicious tree-checker got triggered:
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0
ino=5737268
[ANALYZE]
The root cause is still unclear, but there are some clues already:
- Unaligned eb bytenr
The block bytenr is 8550954455682405139, which is not even aligned to
2.
This bytenr is fetched from extent buffer header, not from eb->start.
This means, at the initial time of read, eb header bytenr is still
correct (the very basis check to continue read), but later something
wrong happened, got at least the first page corrupted.
Thus we got such obviously incorrect value.
- Invalid extent buffer header owner
The read itself is triggered for subvolume 256, but the eb header
owner is 11858205567642294356, which is not really possible.
The problem here is, subvolume id is limited to (1 << 48 - 1),
and this one definitely goes beyond that limit.
So this value is another garbage.
We already got two garbage from an extent buffer, which passed the
initial bytenr and csum checks, but later the contents become garbage at
some point.
This looks like a page lifespan problem (e.g. we didn't properly hold the
page).
[ENHANCEMENT]
The current tree-checker only outputs things from the extent buffer,
nothing with the page status.
So this patch would enhance the tree-checker output by also dumping the
first page, which would look like this:
page:00000000aa9f3ce8 refcount:4 mapcount:0 mapping:00000000169aa6b6 index:0x1d0c pfn:0x1022e5
memcg:ffff888103456000
aops:btree_aops [btrfs] ino:1
flags: 0x2ffff0000008000(private|node=0|zone=2|lastcpupid=0xffff)
page_type: 0xffffffff()
raw: 02ffff0000008000 0000000000000000 dead000000000122 ffff88811e06e220
raw: 0000000000001d0c ffff888102fdb1d8 00000004ffffffff ffff888103456000
page dumped because: eb page dump
BTRFS critical (device dm-3): corrupt leaf: root=5 block=30457856 slot=6 ino=257 file_offset=0, invalid disk_bytenr for file extent, have 10617606235235216665, should be aligned to 4096
BTRFS error (device dm-3): read time tree block corruption detected on logical 30457856 mirror 1
From the dump we can see some extra info, something can help us to do
extra cross-checks:
- Page refcount
if it's too low, it definitely means something bad.
- Page aops
Any mapped eb page should have btree_aops with inode number 1.
- Page index
Since a mapped eb page should has its bytenr matching the page
position, (index << PAGE_SHIFT) should match the bytenr of the
bytenr from the critical line.
- Page Private flags
A mapped eb page should have Private flag set to indicate it's managed
by btrfs.
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit a440d48c7f ("Btrfs: heuristic: implement sampling
logic"), btrfs_compress_heuristic() is no longer a simple "return true",
but more complex to determine if we should compress.
Thus the comment is dead and can be confusing, just remove it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the writer counter, it's pretty much the same as the reader counter,
and they are exclusive. So move them to the new locked bitmap.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_subpage utilizes its atomic member @reader to manage the
reader counter. However it is only utilized to prevent the page to be
released/unlocked when we still have reads underway.
In that use case, we don't really allow multiple readers on the same
subpage sector. So here we can introduce a new locked bitmap to
represent exactly which subpage range is locked for read.
In theory we can remove btrfs_subpage::reader as it's just the set bits
of the new locked bitmap. But unfortunately bitmap doesn't provide such
handy API yet, so we still keep the reader counter.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both functions were introduced in commit 1e1de38792 ("btrfs: make
process_one_page() to handle subpage locking"), but they have never
been utilized out of subpage code. So just unexport them.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can pass a valid em cache pointer down to __get_extent_map() and
drop the validity check. This avoids the special case, the call stacks
are simple:
btrfs_read_folio
btrfs_do_readpage
__get_extent_map
extent_readahead
contiguous_readpages
btrfs_do_readpage
__get_extent_map
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers btrfs_del_delalloc_inode() and __btrfs_del_delalloc_inode()
don't follow the pattern when the "__" helper does a special case and
are in fact reversed regarding the naming. We can merge them into one as
there's only one place that needs to be open coded.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace the two parameters bdev and name by one that can be used to get
them both.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Other errors in flush_reservations() are handled and also in the caller.
Ignoring commit might make some sense as it's called right after join so
it's to poke the whole commit machinery to free space.
However for consistency return the error. The caller
btrfs_quota_disable() would try to start the transaction which would
in turn fail too so there's no effective change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify
the creation of SLAB caches when the default values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify
the creation of SLAB caches related to delayed refs when the default
values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify
the creation of SLAB caches when the default values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify
the creation of SLAB caches when the default values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify
the creation of SLAB caches when the default values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify
the creation of SLAB caches when the default values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers are doing an initialization or release work, none of which
is performance critical that it would require a static inline, so move
them to the .c file.
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers are doing an initialization or release work, none of which
is performance critical that it would require a static inline, so move
them to the .c file.
Signed-off-by: David Sterba <dsterba@suse.com>
Using static inline in a .c file should be justified, e.g. when
functions are on a hot path but none of the affected functions seem to
be. As it's all in one compilation unit let the compiler decide.
Signed-off-by: David Sterba <dsterba@suse.com>
There are many helpers doing simple things but not simple enough to
justify the static inline. None of them seems to be on a hot path so
move them to .c.
Signed-off-by: David Sterba <dsterba@suse.com>
The helper is trivial, we can inline it. It's safe to remove the 'if' as
the iterator is always valid when used, the potential NULL was never
checked anyway.
Signed-off-by: David Sterba <dsterba@suse.com>
The helper is trivial and used only once, open code it. It's safe to
remove the 'if', the pointer is validated in build_backref_tree().
Signed-off-by: David Sterba <dsterba@suse.com>
The from/to CPU/disk helpers for balance args are used only in volumes,
no need to define them in accessors.h.
Signed-off-by: David Sterba <dsterba@suse.com>
We disable offloading checksum to workqueues and do it synchronously when
the checksum algorithm is fast. However, as reported in the link below,
RAID0 with multiple devices may suffer from the sync checksum, because
"fast checksum" is still not fast enough to catch up with RAID0 writing.
We don't have an effective way to determine whether to offload or not,
for now add a sysfs knob so this can be debugged. This is intentionally
under CONFIG_BTRFS_DEBUG so ti's not exposed to users as it may be
removed in the future agin.
Introduce fs_devices->offload_csum_mode, so that a btrfs developer can
change the behavior by writing to /sys/fs/btrfs/<uuid>/offload_csum. The
default is "auto" which is the same as the previous behavior. Or, you
can set "on" or "off" (or "y" or "n" whatever kstrtobool() accepts) to
always/never offload checksum.
More benchmark need to be collected with this knob to implement a proper
criteria to enable/disable checksum offloading.
Link: https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
I have got at least two crash report for RAID6 syndrome generation, no
matter if it's AVX2 or SSE2, they all seems to have a similar
calltrace with corrupted RAX:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
Workqueue: btrfs-rmw rmw_rbio_work [btrfs]
RIP: 0010:raid6_sse21_gen_syndrome+0x9e/0x130 [raid6_pq]
RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248
RDX: 0000000000000000 RSI: ffffa0f74cfa3238 RDI: 0000000000000000
Call Trace:
<TASK>
rmw_rbio+0x5c8/0xa80 [btrfs]
process_one_work+0x1c7/0x3d0
worker_thread+0x4d/0x380
kthread+0xf3/0x120
ret_from_fork+0x2c/0x50
</TASK>
[CAUSE]
The cause is not known. Recently I also hit this in AVX512 path, and
that's even in v5.15 backport, which doesn't have any of my RAID56
rework.
Furthermore according to the registers:
RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248
The RAX register is showing the number of stripes (including PQ), which
is not correct (0). But the remaining two registers are all sane.
- RBX is the sectorsize
For x86_64 it should always be 4K and matches the output.
- RCX is the pointers array
Which is from rbio->finish_pointers, and it looks like a sane
kernel address.
[WORKAROUND]
For now, I can only add extra debug ASSERT()s before we call raid6
gen_syndrome() helper and hopes to catch the problem.
The debug requires both CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT
enabled.
My current guess is some use-after-free, but every report is only having
corrupted RAX but seemingly valid pointers doesn't make much sense.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_free_tree_block(), we are always initializing a delayed reference
to drop the given extent buffer but we only use if it does not belong to a
log root tree. So we are doing unnecessary work here and increasing the
duration of a critical section as this is normally called while holding a
lock on the parent tree block (if any) and while holding a log transaction
open.
So initialize the delayed reference only if the extent buffer is not from
a log tree, avoiding unnecessary work and making the code also a bit
easier to follow.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During an incremental send, before determining if we need to send a hole
(write operations full of zeroes) we will search for the last extent's
end offset if we are at the first slot of a leaf and the last processed
extent's end offset is smaller then the current extent's start offset.
However we are repeating this search in case we had the last extent's end
offset undefined (set to the (u64)-1 value) when we entered
maybe_send_hole(), wasting time.
So avoid this duplicated search by combining the two conditions that
trigger a search for the last extent's end offset into a single if
statement.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The validation of vol args v2 name in snapshot and device remove ioctls
is not done properly. A terminating NUL is written to the end of the
buffer unconditionally, assuming that this would be the last place in
case the buffer is used completely. This does not communicate back the
actual error (either an invalid or too long path).
Factor out all such cases and use a helper to do the verification,
simply look for NUL in the buffer. There's no expected practical
change, the size of buffer is 4088, this is enough for most paths or
names.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The validation of vol args name in several ioctls is not done properly.
a terminating NUL is written to the end of the buffer unconditionally,
assuming that this would be the last place in case the buffer is used
completely. This does not communicate back the actual error (either an
invalid or too long path).
Factor out all such cases and use a helper to do the verification,
simply look for NUL in the buffer. There's no expected practical change,
the size of buffer is 4088, this is enough for most paths or names.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_transaction_in_commit() is no longer used, its last
use was removed in commit 11aeb97b45 ("btrfs: don't arbitrarily slow
down delalloc if we're committing"), so just remove it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The IS_ENABLED() macro already guarantees the result will be a
suitable boolean return value ("1" for enabled, and "0" for disabled).
Thus, it seems that the "!!" used right before is unnecessary to force
the 0/1 values.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of the BUG_ON is not clear. The helper btrfs_grab_root()
could return a NULL in case args->root would be a NULL or if there are
zero references. Then we check if the root pointer stored in the inode
still exists.
The whole call chain is for iget:
btrfs_iget
btrfs_iget_path
btrfs_iget_locked
iget5_locked
btrfs_init_locked_inode
which is called from many contexts where we the root pointer is used and
we can safely assume has enough references.
Signed-off-by: David Sterba <dsterba@suse.com>
Checking extent item size in add_inline_refs() is redundant, we do that
already in tree-checker after reading the extent buffer and it won't
change under normal circumstances. It was added long ago in
8da6d5815c ("Btrfs: added btrfs_find_all_roots()") and does not seem
to have a clear purpose.
Similar case in extent_from_logical(), added in a542ad1baf ("btrfs:
added helper functions to iterate backrefs").
Signed-off-by: David Sterba <dsterba@suse.com>
The BUG_ON is deep in the qgroup code where we can expect that it
exists. A NULL pointer would cause a crash.
It was added long ago in 550d7a2ed5 ("btrfs: qgroup: Add new qgroup
calculation function btrfs_qgroup_account_extents()."). It maybe made
sense back then as the quota enable/disable state machine was not that
robust as it is nowadays, so we can just delete it.
Signed-off-by: David Sterba <dsterba@suse.com>
The only caller do_walk_down() of btrfs_qgroup_trace_subtree() validates
the value of level and uses it several times before it's passed as an
argument. Same for root_eb that's called 'next' in the caller.
Change both BUG_ONs to assertions as this is to assure proper interface
use rather than real errors.
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one caller of tree_move_down() that does not pass level 0
so the assertion is better suited here.
Signed-off-by: David Sterba <dsterba@suse.com>
Change BUG_ON to proper error handling if building the path buffer
fails. The pointers are not printed so we don't accidentally leak kernel
addresses.
Signed-off-by: David Sterba <dsterba@suse.com>
Change BUG_ON to proper error handling when an unexpected inode number
is encountered. As the comment says this should never happen.
Signed-off-by: David Sterba <dsterba@suse.com>
Change BUG_ON to a proper error handling in the unlikely case of seeing
data when the command is started. This is supposed to be reset when the
command is finished (send_cmd, send_encoded_extent).
Signed-off-by: David Sterba <dsterba@suse.com>
The may_destroy_subvol() looks up a root by a key, allowing to do an
inexact search when key->offset is -1. It's never expected to find such
item, as it would break the allowed range of a root id.
Signed-off-by: David Sterba <dsterba@suse.com>
The find_first_extent_item() helper looks up an extent item by a key,
allowing to do an inexact search when key->offset is -1. It's never
expected to find such item, as it would break the allowed range of a
extent item offset.
Signed-off-by: David Sterba <dsterba@suse.com>
The extent_from_logical() helper looks up an extent item by a key,
allowing to do an inexact search when key->offset is -1. It's never
expected to find such item, as it would break the allowed range of a
extent item offset.
The same error is already handled in btrfs_backref_iter_start() so add a
comment for consistency.
Signed-off-by: David Sterba <dsterba@suse.com>
Same comment was added to this type of error, unify that and drop the
assertion as we'd find out quickly that something is wrong after
returning -EUCLEAN.
Signed-off-by: David Sterba <dsterba@suse.com>
The memory allocation error in add_async_extent() is not handled
properly, return an error and push the BUG_ON to the caller. Handling it
there is not trivial so at least make it visible.
Signed-off-by: David Sterba <dsterba@suse.com>
The "do_list" variable has a rather confusing name, so remove it and
directly use btrfs_is_free_space_inode() instead.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The "do_list" variable is only used once, plus its name/meaning is a bit
confusing, so remove it and directory use btrfs_is_free_space_inode().
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When adding or removing and inode to/from the root's delalloc list,
instead of using a BUG_ON() to validate list emptiness, use ASSERT()
since this is to check logic errors rather than real errors.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The merge and split callbacks for an inode's io tree are supposed to be
called while the io tree's spinlock is being held, so that the given
extent_state records are stable, not modified or freed while the callbacks
are using them. So add lockdep assertions in the callbacks.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When setting and clearing a delalloc range, at btrfs_set_delalloc_extent()
and btrfs_clear_delalloc_extent(), we are adding/removing the inode
to/from the root's list of delalloc inodes while under the protection of
the inode's lock. This however is not needed, we can add and remove the
inode to the root's list without holding the inode's lock because here
we are under the protection of the io tree's lock, reducing the size of
the critical section delimited by the inode's lock. The inode's lock is
used in many other places such as when finishing an ordered extent (when
calling btrfs_update_inode_bytes() or btrfs_delalloc_release_metadata(),
or decreasing the number of outstanding extents) or when reserving space
when doing a buffered or direct IO write (calls to functions from
delalloc-space.c).
So move the inode add/remove operations to the root's list of delalloc
inodes to outside the critical section delimited by the inode's lock.
This also allows us to get rid of the BTRFS_INODE_IN_DELALLOC_LIST flag
since we can rely on the inode's delalloc bytes counter to determine if
the inode is or is not in the list.
The following fio based test, that exercises IO to multiple files in the
same subvolume, was used to test:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio --direct=0 --ioengine=sync --thread --directory=$MNT \
--invalidate=1 --group_reporting=1 \
--new_group --rw=randwrite --size=50m --numjobs=200 \
--bs=4k --fsync_on_close=0 --fallocate=none --end_fsync=0 \
--name=foo --filename_format=FioWorkloads.\$jobnum
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config)
against a 16G null block device.
Result before this patch:
WRITE: bw=81.9MiB/s (85.9MB/s), 81.9MiB/s-81.9MiB/s (85.9MB/s-85.9MB/s), io=9.77GiB (10.5GB), run=122136-122136msec
Result after this patch:
WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=9.77GiB (10.5GB), run=115180-115180msec
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_add_delalloc_inodes() adds a single inode its root's
list of delalloc inodes, so it doesn't make any sense at all for the
function's name to be plural. Rename it to the singular form
btrfs_add_delalloc_inode() to avoid any confusion.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function requires the delalloc lock of the inode's root to be held,
so assert it's held.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to pass a root argument to __btrfs_del_delalloc_inode()
and btrfs_del_delalloc_inode(), we can just pass the inode since the root
is always the root associated to that inode. Some remove the root argument
from these functions.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to pass a root argument to btrfs_add_delalloc_inodes(), we
can just pass the inode since the root is always the root associated to
the inode in the context it's called. So remove it and have the single
caller pass only the inode.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Do a cleanup in the rest of the headers:
- add forward declarations for types referenced by pointers
- add includes when types need them
This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.
Signed-off-by: David Sterba <dsterba@suse.com>
Do a cleanup in more headers:
- add forward declarations for types referenced by pointers
- add includes when types need them
This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.
Signed-off-by: David Sterba <dsterba@suse.com>
Do a cleanup in the short headers:
- add forward declarations for types referenced by pointers
- add includes when types need them
This fixes potential compilation problems if the headers are reordered
or the missing includes are not provided indirectly.
Signed-off-by: David Sterba <dsterba@suse.com>