Several functions currently populate an array of page pointers one
allocated page at a time. Factor out the common code so as to allow
improvements to all of the sites at once.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Explicit type casts are not necessary when it's void* to another pointer
type.
Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The original code resets the page to 0x1 for not apparent reason, it's
been like that since the initial 2007 code added in commit 07157aacb1
("Btrfs: Add file data csums back in via hooks in the extent map code").
It could mean that a failed buffer can be detected from the data but
that's just a guess and any value is good.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
proceed with the non-blocking, NOWAIT path. However reserving the metadata
space and qgroup meta space can often result in blocking - flushing
delalloc, wait for ordered extents to complete, trigger transaction
commits, etc, going against the semantics of a NOWAIT write.
So make the NOWAIT write path to try to reserve all the metadata it needs
without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
then return -EAGAIN to make the caller fallback to a blocking direct IO
write.
This is part of a patchset comprised of the following patches:
btrfs: avoid blocking on page locks with nowait dio on compressed range
btrfs: avoid blocking nowait dio when locking file range
btrfs: avoid double nocow check when doing nowait dio writes
btrfs: stop allocating a path when checking if cross reference exists
btrfs: free path at can_nocow_extent() before checking for checksum items
btrfs: release path earlier at can_nocow_extent()
btrfs: avoid blocking when allocating context for nowait dio read/write
btrfs: avoid blocking on space revervation when doing nowait dio writes
The following test was run before and after applying this patchset:
$ cat io-uring-nodatacow-test.sh
#!/bin/bash
DEV=/dev/sdc
MNT=/mnt/sdc
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
NUM_JOBS=4
FILE_SIZE=8G
RUN_TIME=300
cat <<EOF > /tmp/fio-job.ini
[io_uring_rw]
rw=randrw
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
filename=foobar
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run a 12 cores box with 64G of ram, using a non-debug kernel
config (Debian's default config) and a spinning disk.
Result before the patchset:
READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
Result after the patchset:
READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec
That's about +7.2% throughput for reads and +6.9% for writes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a NOWAIT direct IO read/write, we allocate a context object
(struct btrfs_dio_data) with GFP_NOFS, which can result in blocking
waiting for memory allocation (GFP_NOFS is __GFP_RECLAIM | __GFP_IO).
This is undesirable for the NOWAIT semantics, so do the allocation with
GFP_NOWAIT if we are serving a NOWAIT request and if the allocation fails
return -EAGAIN, so that the caller can fallback to a blocking context and
retry with a non-blocking write.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At can_nocow_extent(), we are releasing the path only after checking if
the block group that has the target extent is read only, and after
checking if there's delalloc in the range in case our extent is a
preallocated extent. The read only extent check can be expensive if we
have a very large filesystem with many block groups, as well as the
check for delalloc in the inode's io_tree in case the io_tree is big
due to IO on other file ranges.
Our path is holding a read lock on a leaf and there's no need to keep
the lock while doing those two checks, so release the path before doing
them, immediately after the last use of the leaf.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we look for checksum items, through csum_exist_in_range(), at
can_nocow_extent(), we no longer need the path that we have previously
allocated. Through csum_exist_in_range() -> btrfs_lookup_csums_range(),
we also end up allocating a path, so we are adding unnecessary extra
memory usage. So free the path before calling csum_exist_in_range().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_cross_ref_exist() we always allocate a path, but we really don't
need to because all its callers (only 2) already have an allocated path
that is not being used when they call btrfs_cross_ref_exist(). So change
btrfs_cross_ref_exist() to take a path as an argument and update both
its callers to pass in the unused path they have when they call
btrfs_cross_ref_exist().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a NOWAIT direct IO write we are checking twice if we can COW
into the target file range using can_nocow_extent() - once at the very
beginning of the write path, at btrfs_write_check() via
check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write().
The can_nocow_extent() function does a lot of expensive things - searching
for the file extent item in the inode's subvolume tree, searching for the
extent item in the extent tree, checking delayed references, etc, so it
isn't a very cheap call.
We can remove the first check at btrfs_write_check(), and add there a
quick check to verify if the inode has the NODATACOW or PREALLOC flags,
and quickly bail out if it doesn't have neither of those flags, as that
means we have to COW and therefore can't comply with the NOWAIT semantics.
After this we do only one call to can_nocow_extent(), while we are at
btrfs_get_blocks_direct_write(), where we have already locked the file
range and we did a try lock on the range before, at
btrfs_dio_iomap_begin() (since the previous patch in the series).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are doing a NOWAIT direct IO read/write, we can block when locking
the file range at btrfs_dio_iomap_begin(), as it's possible the range (or
a part of it) is already locked by another task (mmap writes, another
direct IO read/write racing with us, fiemap, etc). We are also waiting for
completion of any ordered extent we find in the range, which also can
block us for a significant amount of time.
There's also the incorrect fallback to buffered IO (returning -ENOTBLK)
when we are dealing with a NOWAIT request and we can't proceed. In this
case we should be returning -EAGAIN, as falling back to buffered IO can
result in blocking for many different reasons, so that the caller can
delegate a retry to a context where blocking is more acceptable.
Fix these cases by:
1) Doing a try lock on the file range and failing with -EAGAIN if we
can not lock right away;
2) Fail with -EAGAIN if we find an ordered extent;
3) Return -EAGAIN instead of -ENOTBLK when we need to fallback to
buffered IO and we have a NOWAIT request.
This will also allow us to avoid a duplicated check that verifies if we
are able to do a NOCOW write for NOWAIT direct IO writes, done in the
next patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are doing NOWAIT direct IO read/write and our inode has compressed
extents, we call filemap_fdatawrite_range() against the range in order
to wait for compressed writeback to complete, since the generic code at
iomap_dio_rw() calls filemap_write_and_wait_range() once, which is not
enough to wait for compressed writeback to complete.
This call to filemap_fdatawrite_range() can block on page locks, since
the first writepages() on a range that we will try to compress results
only in queuing a work to compress the data while holding the pages
locked.
Even though the generic code at iomap_dio_rw() will do the right thing
and return -EAGAIN for NOWAIT requests in case there are pages in the
range, we can still end up at btrfs_dio_iomap_begin() with pages in the
range because either of the following can happen:
1) Memory mapped writes, as we haven't locked the range yet;
2) Buffered reads might have started, which lock the pages, and we do
the filemap_fdatawrite_range() call before locking the file range.
So don't call filemap_fdatawrite_range() at btrfs_dio_iomap_begin() if we
are doing a NOWAIT read/write. Instead call filemap_range_needs_writeback()
to check if there are any locked, dirty, or under writeback pages, and
return -EAGAIN if that's the case.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have four different scenarios where we don't expect to find ordered
extents after locking a file range:
1) During plain fallocate;
2) During hole punching;
3) During zero range;
4) During reflinks (both cloning and deduplication).
This is because in all these cases we follow the pattern:
1) Lock the inode's VFS lock in exclusive mode;
2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
mmap writes;
3) Flush delalloc in a file range and wait for all ordered extents
to complete - both done through btrfs_wait_ordered_range();
4) Lock the file range in the inode's io_tree.
So add a helper that asserts that we don't have ordered extents for a
given range. Make the four scenarios listed above use this helper after
locking the respective file range.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
According to the tree checker, "all xattrs with a given objectid follow
the inode with that objectid in the tree" is an invariant. This was
broken by the recent change "btrfs: move common inode creation code into
btrfs_create_new_inode()", which moved acl creation and property
inheritance (stored in xattrs) to before inode insertion into the tree.
As a result, under certain timings, the xattrs could be written to the
tree before the inode, causing the tree checker to report violation of
the invariant.
Move property inheritance and acl creation back to their old ordering
after the inode insertion.
Suggested-by: Omar Sandoval <osandov@osandov.com>
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: David Sterba <dsterba@suse.com>
All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:
1. It reduces code duplication.
2. It allows us to set up the inode completely before inserting the
inode item, removing calls to btrfs_update_inode().
3. It fixes a leak of an inode on disk in some error cases. For example,
in btrfs_create(), if btrfs_new_inode() succeeds, then we have
inserted an inode item and its inode ref. However, if something after
that fails (e.g., btrfs_init_inode_security()), then we end the
transaction and then decrement the link count on the inode. If the
transaction is committed and the system crashes before the failed
inode is deleted, then we leak that inode on disk. Instead, this
refactoring aborts the transaction when we can't recover more
gracefully.
4. It exposes various ways that subvolume creation diverges from mkdir
in terms of inheriting flags, properties, permissions, and POSIX
ACLs, a lot of which appears to be accidental. This patch explicitly
does _not_ change the existing non-standard behavior, but it makes
those differences more clear in the code and documents them so that
we can discuss whether they should be changed.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The various inode creation code paths do not account for the compression
property, POSIX ACLs, or the parent inode item when starting a
transaction. Fix it by refactoring all of these code paths to use a new
function, btrfs_new_inode_prepare(), which computes the correct number
of items. To do so, it needs to know whether POSIX ACLs will be created,
so move the ACL creation into that function. To reduce the number of
arguments that need to be passed around for inode creation, define
struct btrfs_new_inode_args containing all of the relevant information.
btrfs_new_inode_prepare() will also be a good place to set up the
fscrypt context and encrypted filename in the future.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_{mknod,create,mkdir}() are now identical other than the inode
initialization and some inconsequential function call order differences.
Factor out the common code to reduce code duplication.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of calling new_inode() and inode_init_owner() inside of
btrfs_new_inode(), do it in the callers. This allows us to pass in just
the inode instead of the mnt_userns and mode and removes the need for
memalloc_nofs_{save,restores}() since we do it before starting a
transaction. In create_subvol(), it also means we no longer have to look
up the inode again to instantiate it. This also paves the way for some
more cleanups in later patches.
This also removes the comments about Smack checking i_op, which are no
longer true since commit 5d6c31910b ("xattr: Add
__vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
IOP_XATTR, which is set based on sb->s_xattr.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function can be simplified by refactoring to use the new iterator
macro. No functional changes.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_new_inode() inherits the inode flags from the parent directory and
the mount options _after_ we fill the inode item. This works because all
of the callers of btrfs_new_inode() make further changes to the inode
and then call btrfs_update_inode(). It'd be better to fully initialize
the inode once to avoid the extra update, so as a first step, set the
inode flags _before_ filling the inode item.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Every call of btrfs_new_inode() is immediately preceded by a call to
btrfs_get_free_objectid(). Since getting an inode number is part of
creating a new inode, this is better off being moved into
btrfs_new_inode(). While we're here, get rid of the comment about
reclaiming inode numbers, since we only did that when using the ino
cache, which was removed by commit 5297199a8b ("btrfs: remove inode
number cache feature").
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For everything other than a subvolume root inode, we get the parent
objectid from the parent directory. For the subvolume root inode, the
parent objectid is the same as the inode's objectid. We can find this
within btrfs_new_inode() instead of passing it.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_new_inode() already returns an inode with nlink set to 1 (via
inode_init_always()). Get rid of the unnecessary set.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
new_inode() always returns an inode with i_blocks and i_bytes set to 0
(via inode_init_always()). Remove the unnecessary call to
inode_set_bytes() in btrfs_new_inode().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_new_inode() always returns an inode with i_size and disk_i_size
set to 0 (via inode_init_always() and btrfs_alloc_inode(),
respectively). Remove the unnecessary calls to btrfs_i_size_write() in
btrfs_mkdir() and btrfs_create_subvol_root().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a trivial wrapper around btrfs_add_link(). The only thing it
does other than moving arguments around is translating a > 0 return
value to -EEXIST. As far as I can tell, btrfs_add_link() won't return >
0 (and if it did, the existing callsites in, e.g., btrfs_mkdir() would
be broken). The check itself dates back to commit 2c90e5d658 ("Btrfs:
still corruption hunting"), so it's probably left over from debugging.
Let's just get rid of btrfs_add_nondir().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_rename() and btrfs_rename_exchange() don't account for enough
items. Replace the incorrect explanations with a specific breakdown of
the number of items and account them accurately.
Note that this glosses over RENAME_WHITEOUT because the next commit is
going to rework that, too.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_unlink_inode() calls btrfs_update_inode() on the parent
directory in order to update its size and sequence number. Make sure we
account for it.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJwBoQACgkQxWXV+ddt
WDu/FQ/+L2LpN5Zu1NkjOAh2Lcvz5RYZjcVext4bbPUW2yhXYH3e6836R/feLOCG
RxRICOAQhJ7I6ct/N1aToI2AbjWnMSJK+IgageA1UdIS8McbcSP/qJOYwJ/+2Xhl
AvK5psoj+qwGbTI9e0luNe6b+UWTGIMyYRjmN0SlkBOdg9/xqQpBMQxfKJumMvEc
3ZwLpcNjhUUwdFKHvHZNCOQhZiZwloKFeq9MLaEL5LO30wKSY6ShiCA3pafFoVN0
mvEEVtIGgZUsgeQTzSzD8UhGDvZtZ1+aaX0dcNMRzwI2h2pkBmPkd5QtFM9Qs0xP
hGibSN9bC/SzQyE9v4cKohwS+g4dE4r+dUWFNpdZLIOpBt5PJBDA0tjcjxquFtMr
6JX77coAl9kt0jspBmHVPb3qmIc1Xo3Iw2PrVgTK14QUo46XwF5Rga68wKOfNt0u
LbD9+KCLnwxoOhvXh/LJX6nvPT8tuMrT/5AOXULI2oMCnpCY6Vl+kX8zFlAfbDk1
d4/jy42bgHCso60j2vAcdQmZB+/snpboOhKJwkE2FqOTs+hBR8PBln6BqEt5xkHZ
q2mBfYDujnmZWaiAU6+ETjOcCooWQioi3335tp/C9TdOrIkFDij1ztYRxhgP0g0X
w6d6wlbkcol1Zubb/zEiBiwe+6GR/KtNYc4PBEfVe5Vw0npFreU=
=7k9f
-----END PGP SIGNATURE-----
Merge tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes mostly around how some file attributes could be set.
- fix handling of compression property:
- don't allow setting it on anything else than regular file or
directory
- do not allow setting it on nodatacow files via properties
- improved error handling when setting xattr
- make sure symlinks are always properly logged"
* tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: skip compression property for anything other than files and dirs
btrfs: do not BUG_ON() on failure to update inode when setting xattr
btrfs: always log symlinks in full mode
btrfs: do not allow compression on nodatacow files
btrfs: export a helper for compression hard check
inode_can_compress will be used outside of inode.c to check the
availability of setting compression flag by xattr. This patch moves
this function as an internal helper and renames it to
btrfs_inode_can_compress.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJnGGIACgkQxWXV+ddt
WDumDw//cE1NcawdnVkEaKr20PetHfzPyFSIIr17nedtnVvWYyOFF/0uJlHNhv8Z
CZIfJ7fmH/pO5oWPXN84wKNfumDWNwc36QrvoXC67TrKUSiBN8BzL83HvAjGwYFH
G+LfZXGnVbqq8F1iYkIsuH0Oo1x/N/LPM3s6iZy3O4l8s96u+J4GRnc8Tr0AH4MA
zgz3fab8Ec378HTG9fvdAQNLxFEe0VatD6WrzILnmM8UgeQK7g73dqH9Ni9gz2DW
2GDlO6aevQ1G6dm2AJ0ItExnbHH7TfOThkG56Gdqrzb/d39GzrVpeob7QiorETus
EWS1rXaeikUiD4Bzt/RszUNL80yMN1DjcN3QBkiDf3ShSDFteoHMPw3e6jcQCy1m
Dxf5oditQqltuFNLeSiVbZEMw2kXqBP7RoPiirF9rdvrDNLHhAE9wu0kpSGSSvT7
Tyu9JyLw2axU6wGTi1GHAXurlW2ItRRyFAewWWul1lLkuz/6YXI4F/EHm3Mbh6Nh
pMIFMNr4Oafdx+3Ful8ZA4PynirXub/xVDefcFBibz/PTGEnHG4ZVzRudmVnowh7
GP2pql1+Y/TFkXdD98V8GWD+E10JAmNCkQSoiggJooNWR28whukmDVX/HY8lGmWg
DjxwGkte3SltUBWNOTGnO7546hMwOxOPZHENPh+gffYkeMeIxPI=
=xDWz
-----END PGP SIGNATURE-----
Merge tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- direct IO fixes:
- restore passing file offset to correctly calculate checksums
when repairing on read and bio split happens
- use correct bio when sumitting IO on zoned filesystem
- zoned mode fixes:
- fix selection of device to correctly calculate device
capabilities when allocating a new bio
- use a dedicated lock for exclusion during relocation
- fix leaked plug after failure syncing log
- fix assertion during scrub and relocation
* tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: use dedicated lock for data relocation
btrfs: fix assertion failure during scrub due to block group reallocation
btrfs: fix direct I/O writes for split bios on zoned devices
btrfs: fix direct I/O read repair for split bios
btrfs: fix and document the zoned device choice in alloc_new_bio
btrfs: fix leaked plug after failure syncing log on zoned filesystems
When a bio is split in btrfs_submit_direct, dip->file_offset contains
the file offset for the first bio. But this means the start value used
in btrfs_end_dio_bio to record the write location for zone devices is
incorrect for subsequent bios.
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
When a bio is split in btrfs_submit_direct, dip->file_offset contains
the file offset for the first bio. But this means the start value used
in btrfs_check_read_dio_bio is incorrect for subsequent bios. Add
a file_offset field to struct btrfs_bio to pass along the correct offset.
Given that check_data_csum only uses start of an error message this
means problems with this miscalculation will only show up when I/O fails
or checksums mismatch.
The logic was removed in f4f39fc5dc ("btrfs: remove btrfs_bio::logical
member") but we need it due to the bio splitting.
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJUfvwACgkQxWXV+ddt
WDtiuA//csj0CrJq7wyRvgDkbPtCCMyDtL4zfmjP4s++NWaMDOKTxBU8msuGUJgR
Xribel2zWqlFiOzptd9sGEfxfKfz1p5Rm/gtFj57WVSGV7YtiAGyFGuzn/vpCrgq
NP5LFY2z5N36VxDXUPKHzvdqczO5W2n9KdaysJCr6FpCz+vVrplFiT5U+X175Sgg
zWS/XrPIHYbtaEFdb3rUKol6riu7vXW3MxEA9di8K4Xo9TJrp0NtBoGZDsWFQxf7
vfVwtYsQPsACJxw+MjBcVQ5fNXO5iATL1JfBb9ltN589xouja7bDCb40Fm2gEwWB
IUatnCq/4MN2S2NdPtEcXQ52W9svT/87z6ZblefSEiQqvQBBJHN131RTM/s8LBG4
fkHcGV6PsiutSIFycrID0bpXr1Mhvg2aMjxdriLPBtYopaMPh+ivK6LPYoE5MggQ
/rshWfjiWJhPKXPsng+H7UGbViOiYeG0kUBuaqFx4ARnESpN1gF2dJRYvXYFL/8a
Q0wmLr1tf3M82VAaAFNOl/BVk8blutCSHcJDLDKxcl3DhVVlY5J8onEPXjneJRkk
rRB3zoxLlptgfW75CPNyFrpicPdAXzGCoccienIUoEdHSX/W5rNA4L6XpVE5Hv94
dWR1aVAbCUcuhY1QfBU7H2iYw0RHzqDO3+IRfqJuYsLGciijLbs=
=APMY
-----END PGP SIGNATURE-----
Merge tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more code and warning fixes.
There's one feature ioctl removal patch slated for 5.18 that did not
make it to the main pull request. It's just a one-liner and the ioctl
has a v2 that's in use for a long time, no point to postpone it to
5.19.
Late update:
- remove balance v1 ioctl, superseded by v2 in 2012
Fixes:
- add back cgroup attribution for compressed writes
- add super block write start/end annotations to asynchronous balance
- fix root reference count on an error handling path
- in zoned mode, activate zone at the chunk allocation time to avoid
ENOSPC due to timing issues
- fix delayed allocation accounting for direct IO
Warning fixes:
- simplify assertion condition in zoned check
- remove an unused variable"
* tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix btrfs_submit_compressed_write cgroup attribution
btrfs: fix root ref counts in error handling in btrfs_get_root_ref
btrfs: zoned: activate block group only for extent allocation
btrfs: return allocated block group from do_chunk_alloc()
btrfs: mark resumed async balance as writing
btrfs: remove support of balance v1 ioctl
btrfs: release correct delalloc amount in direct IO write path
btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups()
btrfs: zoned: remove redundant condition in btrfs_run_delalloc_range
Running generic/406 causes the following WARNING in btrfs_destroy_inode()
which tells there are outstanding extents left.
In btrfs_get_blocks_direct_write(), we reserve a temporary outstanding
extents with btrfs_delalloc_reserve_metadata() (or indirectly from
btrfs_delalloc_reserve_space(()). We then release the outstanding extents
with btrfs_delalloc_release_extents(). However, the "len" can be modified
in the COW case, which releases fewer outstanding extents than expected.
Fix it by calling btrfs_delalloc_release_extents() for the original length.
To reproduce the warning, the filesystem should be 1 GiB. It's
triggering a short-write, due to not being able to allocate a large
extent and instead allocating a smaller one.
WARNING: CPU: 0 PID: 757 at fs/btrfs/inode.c:8848 btrfs_destroy_inode+0x1e6/0x210 [btrfs]
Modules linked in: btrfs blake2b_generic xor lzo_compress
lzo_decompress raid6_pq zstd zstd_decompress zstd_compress xxhash zram
zsmalloc
CPU: 0 PID: 757 Comm: umount Not tainted 5.17.0-rc8+ #101
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014
RIP: 0010:btrfs_destroy_inode+0x1e6/0x210 [btrfs]
RSP: 0018:ffffc9000327bda8 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff888100548b78 RCX: 0000000000000000
RDX: 0000000000026900 RSI: 0000000000000000 RDI: ffff888100548b78
RBP: ffff888100548940 R08: 0000000000000000 R09: ffff88810b48aba8
R10: 0000000000000001 R11: ffff8881004eb240 R12: ffff88810b48a800
R13: ffff88810b48ec08 R14: ffff88810b48ed00 R15: ffff888100490c68
FS: 00007f8549ea0b80(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f854a09e733 CR3: 000000010a2e9003 CR4: 0000000000370eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
destroy_inode+0x33/0x70
dispose_list+0x43/0x60
evict_inodes+0x161/0x1b0
generic_shutdown_super+0x2d/0x110
kill_anon_super+0xf/0x20
btrfs_kill_super+0xd/0x20 [btrfs]
deactivate_locked_super+0x27/0x90
cleanup_mnt+0x12c/0x180
task_work_run+0x54/0x80
exit_to_user_mode_prepare+0x152/0x160
syscall_exit_to_user_mode+0x12/0x30
do_syscall_64+0x42/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f854a000fb7
Fixes: f0bfa76a11 ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
CC: stable@vger.kernel.org # 5.17
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logic !A || A && B is equivalent to !A || B. so we can
make code clear.
Note: though it's preferred to be in the more human readable form, there
have been repeated reports and patches as the expression is detected by
tools so apply it to reduce the load.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJLaJoACgkQxWXV+ddt
WDvd3g/+KXpWfPOg4h7rCKiZJoE0/XnAaqHacGmIy/8+TqiFm7VEdG2uA4wEjqYq
yeCr+2NhsWcGKOWARUPmP8sBmla98tE0+1abxQFYhGdsLMcgbI6EyW+j53E7++gY
0j4ZY4s7c1c4lsyStNrS7kauDcbZ3MHWhG1VNdTyWCf6wal/5jU93SwdDSz81Int
iD8y2nVQoHcWdyBbPA7t9dYL5cCGR425gRrSb3Yhvc3cI6KJLbPLDpt1AAnR6XVx
W/ELKYMVBF70nTqp3ptBTfbmm83/AcrA1/Epoe2sEV+3gaejx1vkIYwcWlvgW//+
e980zd0k/QSse8O8s66Z7c9QnLML+L/6xxK2vJObw85NFzEGkrmoZdCa7HyyI3MS
pkI0ox9z4I4OEgzBaH8ZpUYgRxQlRnbDv58GrTs/aEBuNaHGQJfiCmh4sx75jGvR
eXMnqmL/EIKqZqTsLfWVuMLC7ZgTG8V76VOJ3gDE4v5Yxi306bCLcdTS+0HYz5GA
fkCisFy69jSbkMbOClTegCkYiRHxV6GjI19yPqj29OsnFpk+htnubOvgs3Q1link
odpBavejmbVxffrYw92Qm7NRMEldZoSaMa39zFJ9Wgtwui7Oe66K3JFlrSnth5mD
YowfkApQCzsSiXzitwLEHmfs7F1MVh2a0jY4hTz1XrizxZPWKHE=
=XZaw
-----END PGP SIGNATURE-----
Merge tag 'for-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- prevent deleting subvolume with active swapfile
- fix qgroup reserve limit calculation overflow
- remove device count in superblock and its item in one transaction so
they cant't get out of sync
- skip defragmenting an isolated sector, this could cause some extra IO
- unify handling of mtime/permissions in hole punch with fallocate
- zoned mode fixes:
- remove assert checking for only single mode, we have the
DUP mode implemented
- fix potential lockdep warning while traversing devices
when checking for zone activation
* tag 'for-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: prevent subvol with swapfile from being deleted
btrfs: do not warn for free space inode in cow_file_range
btrfs: avoid defragging extents whose next extents are not targets
btrfs: fix fallocate to use file_modified to update permissions consistently
btrfs: remove device item and update super block in the same transaction
btrfs: fix qgroup reserve overflow the qgroup limit
btrfs: zoned: remove left over ASSERT checking for single profile
btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zone
While btrfs doesn't use large folios yet, this should have been changed
as part of the conversion from invalidatepage to invalidate_folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
A subvolume with an active swapfile must not be deleted otherwise it
would not be possible to deactivate it.
After the subvolume is deleted, we cannot swapoff the swapfile in this
deleted subvolume because the path is unreachable. The swapfile is
still active and holding references, the filesystem cannot be unmounted.
The test looks like this:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
btrfs sub create $mnt/subvol
touch $mnt/subvol/swapfile
chmod 600 $mnt/subvol/swapfile
chattr +C $mnt/subvol/swapfile
dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096
mkswap $mnt/subvol/swapfile
swapon $mnt/subvol/swapfile
btrfs sub delete $mnt/subvol
swapoff $mnt/subvol/swapfile # failed: No such file or directory
swapoff --all
unmount $mnt # target is busy.
To prevent above issue, we simply check that whether the subvolume
contains any active swapfile, and stop the deleting process. This
behavior is like snapshot ioctl dealing with a swapfile.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Kaiwen Hu <kevinhu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a long time leftover from when I originally added the free space
inode, the point was to catch cases where we weren't honoring the NOCOW
flag. However there exists a race with relocation, if we allocate our
free space inode in a block group that is about to be relocated, we
could trigger the COW path before the relocation has the opportunity to
find the extents and delete the free space cache. In production where
we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around
5k times in a 2 week period, so not super common but enough that it's at
the top of our metrics.
We're properly handling the error here, and with us phasing out v1 space
cache anyway just drop the WARN_ON_ONCE.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Primarily this series converts some of the address_space operations
to take a folio instead of a page.
->is_partially_uptodate() takes a folio instead of a page and changes the
type of the 'from' and 'count' arguments to make it obvious they're bytes.
->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
->launder_page() becomes ->launder_folio()
->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
=cOiq
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache
Pull filesystem folio updates from Matthew Wilcox:
"Primarily this series converts some of the address_space operations to
take a folio instead of a page.
Notably:
- a_ops->is_partially_uptodate() takes a folio instead of a page and
changes the type of the 'from' and 'count' arguments to make it
obvious they're bytes.
- a_ops->invalidatepage() becomes ->invalidate_folio() and has a
similar type change.
- a_ops->launder_page() becomes ->launder_folio()
- a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
address_space as an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request"
* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
fs: Remove aops ->set_page_dirty
fb_defio: Use noop_dirty_folio()
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
fs: Convert __set_page_dirty_buffers to block_dirty_folio
nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
mm: Convert swap_set_page_dirty() to swap_dirty_folio()
ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
btrfs: Convert extent_range_redirty_for_io() to use folios
fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
btrfs: Convert from set_page_dirty to dirty_folio
fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
fs: Add aops->dirty_folio
fs: Remove aops->launder_page
orangefs: Convert launder_page to launder_folio
nfs: Convert from launder_page to launder_folio
fuse: Convert from launder_page to launder_folio
...
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().
Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These filesystems use __set_page_dirty_nobuffers() either directly or
with a very thin wrapper; convert them en masse.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
A lot of the underlying infrastructure in btrfs needs to be switched
over to folios, but this at least documents that invalidatepage can't
be passed a tail page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
When an inode has a last_reflink_trans matching the current transaction,
we have to take special care when logging its checksums in order to
avoid getting checksum items with overlapping ranges in a log tree,
which could result in missing checksums after log replay (more on that
in the changelogs of commit 40e046acbd ("Btrfs: fix missing data
checksums after replaying a log tree") and commit e289f03ea7 ("btrfs:
fix corrupt log due to concurrent fsync of inodes with shared extents")).
We also need to make sure a full fsync will copy all old file extent
items it finds in modified leaves, because they might have been copied
from some other inode.
However once we fsync an inode, we don't need to keep paying the price of
that extra special care in future fsyncs done in the same transaction,
unless the inode is used for another reflink operation or the full sync
flag is set on it (truncate, failure to allocate extent maps for holes,
and other exceptional and infrequent cases).
So after we fsync an inode reset its last_unlink_trans to zero. In case
another reflink happens, we continue to update the last_reflink_trans of
the inode, just as before. Also set last_reflink_trans to the generation
of the last transaction that modified the inode whenever we need to set
the full sync flag on the inode, just like when we need to load an inode
from disk after eviction.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I hit some weird panics while fixing up the error handling from
btrfs_lookup_bio_sums(). Turns out the compression path will complete
the bio we use if we set up any of the compression bios and then return
an error, and then btrfs_submit_data_bio() will also call bio_endio() on
the bio.
Fix this by making btrfs_submit_compressed_read() responsible for
calling bio_endio() on the bio if there are any errors. Currently it
was only doing it if we created the compression bios, otherwise it was
depending on btrfs_submit_data_bio() to do the right thing. This
creates the above problem, so fix up btrfs_submit_compressed_read() to
always call bio_endio() in case of an error, and then simply return from
btrfs_submit_data_bio() if we had to call
btrfs_submit_compressed_read().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback. This
also creates inline extents when possible.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, an inline extent is always created after i_size is extended
from btrfs_dirty_pages(). However, for encoded writes, we only want to
update i_size after we successfully created the inline extent. Add an
update_i_size parameter to cow_file_range_inline() and
insert_inline_extent() and pass in the size of the extent rather than
determining it from i_size.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>