Commit Graph

2193 Commits

Author SHA1 Message Date
Christoph Hellwig
36e8c62273 btrfs: add a btrfs_dio_rw wrapper
Add a wrapper around iomap_dio_rw that keeps the direct I/O internals
isolated in inode.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
David Sterba
cb3a12d988 btrfs: rename bio_flags in parameters and switch type
Several functions take parameter bio_flags that was simplified to just
compress type, unify it and change the type accordingly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
David Sterba
2a5232a8ce btrfs: simplify handling of bio_ctrl::bio_flags
The bio_flags are used only to encode the compression and there are no
other EXTENT_BIO_* flags, so the compress type can be stored directly.
The struct member name is left unchanged and will be cleaned in later
patches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
David Sterba
a6f5e39ee7 btrfs: remove unused parameter bio_flags from btrfs_wq_submit_bio
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
Filipe Manana
f5585f4f0e btrfs: fix deadlock between concurrent dio writes when low on free data space
When reserving data space for a direct IO write we can end up deadlocking
if we have multiple tasks attempting a write to the same file range, there
are multiple extents covered by that file range, we are low on available
space for data and the writes don't expand the inode's i_size.

The deadlock can happen like this:

1) We have a file with an i_size of 1M, at offset 0 it has an extent with
   a size of 128K and at offset 128K it has another extent also with a
   size of 128K;

2) Task A does a direct IO write against file range [0, 256K), and because
   the write is within the i_size boundary, it takes the inode's lock (VFS
   level) in shared mode;

3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and
   then gets the extent map for the extent covering the range [0, 128K).
   At btrfs_get_blocks_direct_write(), it creates an ordered extent for
   that file range ([0, 128K));

4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file
   range [0, 256K);

5) Task A executes btrfs_dio_iomap_begin() again, this time for the file
   range [128K, 256K), and locks the file range [128K, 256K);

6) Task B starts a direct IO write against file range [0, 256K) as well.
   It also locks the inode in shared mode, as it's within the i_size limit,
   and then tries to lock file range [0, 256K). It is able to lock the
   subrange [0, 128K) but then blocks waiting for the range [128K, 256K),
   as it is currently locked by task A;

7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data
   space. Because we are low on available free space, it triggers the
   async data reclaim task, and waits for it to reserve data space;

8) The async reclaim task decides to wait for all existing ordered extents
   to complete (through btrfs_wait_ordered_roots()).
   It finds the ordered extent previously created by task A for the file
   range [0, 128K) and waits for it to complete;

9) The ordered extent for the file range [0, 128K) can not complete
   because it blocks at btrfs_finish_ordered_io() when trying to lock the
   file range [0, 128K).

   This results in a deadlock, because:

   - task B is holding the file range [0, 128K) locked, waiting for the
     range [128K, 256K) to be unlocked by task A;

   - task A is holding the file range [128K, 256K) locked and it's waiting
     for the async data reclaim task to satisfy its space reservation
     request;

   - the async data reclaim task is waiting for ordered extent [0, 128K)
     to complete, but the ordered extent can not complete because the
     file range [0, 128K) is currently locked by task B, which is waiting
     on task A to unlock file range [128K, 256K) and task A waiting
     on the async data reclaim task.

   This results in a deadlock between 4 task: task A, task B, the async
   data reclaim task and the task doing ordered extent completion (a work
   queue task).

This type of deadlock can sporadically be triggered by the test case
generic/300 from fstests, and results in a stack trace like the following:

[12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds.
[12084.034877]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
[12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12084.036548] task:kworker/u16:7   state:D stack:    0 pid:123749 ppid:     2 flags:0x00004000
[12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[12084.036599] Call Trace:
[12084.036601]  <TASK>
[12084.036606]  __schedule+0x3cb/0xed0
[12084.036616]  schedule+0x4e/0xb0
[12084.036620]  btrfs_start_ordered_extent+0x109/0x1c0 [btrfs]
[12084.036651]  ? prepare_to_wait_exclusive+0xc0/0xc0
[12084.036659]  btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs]
[12084.036688]  btrfs_work_helper+0xf8/0x400 [btrfs]
[12084.036719]  ? lock_is_held_type+0xe8/0x140
[12084.036727]  process_one_work+0x252/0x5a0
[12084.036736]  ? process_one_work+0x5a0/0x5a0
[12084.036738]  worker_thread+0x52/0x3b0
[12084.036743]  ? process_one_work+0x5a0/0x5a0
[12084.036745]  kthread+0xf2/0x120
[12084.036747]  ? kthread_complete_and_exit+0x20/0x20
[12084.036751]  ret_from_fork+0x22/0x30
[12084.036765]  </TASK>
[12084.036769] INFO: task kworker/u16:11:153787 blocked for more than 241 seconds.
[12084.037702]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
[12084.038540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12084.039506] task:kworker/u16:11  state:D stack:    0 pid:153787 ppid:     2 flags:0x00004000
[12084.039511] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
[12084.039551] Call Trace:
[12084.039553]  <TASK>
[12084.039557]  __schedule+0x3cb/0xed0
[12084.039566]  schedule+0x4e/0xb0
[12084.039569]  schedule_timeout+0xed/0x130
[12084.039573]  ? mark_held_locks+0x50/0x80
[12084.039578]  ? _raw_spin_unlock_irq+0x24/0x50
[12084.039580]  ? lockdep_hardirqs_on+0x7d/0x100
[12084.039585]  __wait_for_common+0xaf/0x1f0
[12084.039587]  ? usleep_range_state+0xb0/0xb0
[12084.039596]  btrfs_wait_ordered_extents+0x3d6/0x470 [btrfs]
[12084.039636]  btrfs_wait_ordered_roots+0x175/0x240 [btrfs]
[12084.039670]  flush_space+0x25b/0x630 [btrfs]
[12084.039712]  btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[12084.039747]  process_one_work+0x252/0x5a0
[12084.039756]  ? process_one_work+0x5a0/0x5a0
[12084.039758]  worker_thread+0x52/0x3b0
[12084.039762]  ? process_one_work+0x5a0/0x5a0
[12084.039765]  kthread+0xf2/0x120
[12084.039766]  ? kthread_complete_and_exit+0x20/0x20
[12084.039770]  ret_from_fork+0x22/0x30
[12084.039783]  </TASK>
[12084.039800] INFO: task kworker/u16:17:217907 blocked for more than 241 seconds.
[12084.040709]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
[12084.041398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12084.042404] task:kworker/u16:17  state:D stack:    0 pid:217907 ppid:     2 flags:0x00004000
[12084.042411] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[12084.042461] Call Trace:
[12084.042463]  <TASK>
[12084.042471]  __schedule+0x3cb/0xed0
[12084.042485]  schedule+0x4e/0xb0
[12084.042490]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[12084.042539]  ? prepare_to_wait_exclusive+0xc0/0xc0
[12084.042551]  lock_extent_bits+0x37/0x90 [btrfs]
[12084.042601]  btrfs_finish_ordered_io.isra.0+0x3fd/0x960 [btrfs]
[12084.042656]  ? lock_is_held_type+0xe8/0x140
[12084.042667]  btrfs_work_helper+0xf8/0x400 [btrfs]
[12084.042716]  ? lock_is_held_type+0xe8/0x140
[12084.042727]  process_one_work+0x252/0x5a0
[12084.042742]  worker_thread+0x52/0x3b0
[12084.042750]  ? process_one_work+0x5a0/0x5a0
[12084.042754]  kthread+0xf2/0x120
[12084.042757]  ? kthread_complete_and_exit+0x20/0x20
[12084.042763]  ret_from_fork+0x22/0x30
[12084.042783]  </TASK>
[12084.042798] INFO: task fio:234517 blocked for more than 241 seconds.
[12084.043598]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
[12084.044282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12084.045244] task:fio             state:D stack:    0 pid:234517 ppid:234515 flags:0x00004000
[12084.045248] Call Trace:
[12084.045250]  <TASK>
[12084.045254]  __schedule+0x3cb/0xed0
[12084.045263]  schedule+0x4e/0xb0
[12084.045266]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[12084.045298]  ? prepare_to_wait_exclusive+0xc0/0xc0
[12084.045306]  lock_extent_bits+0x37/0x90 [btrfs]
[12084.045336]  btrfs_dio_iomap_begin+0x336/0xc60 [btrfs]
[12084.045370]  ? lock_is_held_type+0xe8/0x140
[12084.045378]  iomap_iter+0x184/0x4c0
[12084.045383]  __iomap_dio_rw+0x2c6/0x8a0
[12084.045406]  iomap_dio_rw+0xa/0x30
[12084.045408]  btrfs_do_write_iter+0x370/0x5e0 [btrfs]
[12084.045440]  aio_write+0xfa/0x2c0
[12084.045448]  ? __might_fault+0x2a/0x70
[12084.045451]  ? kvm_sched_clock_read+0x14/0x40
[12084.045455]  ? lock_release+0x153/0x4a0
[12084.045463]  io_submit_one+0x615/0x9f0
[12084.045467]  ? __might_fault+0x2a/0x70
[12084.045469]  ? kvm_sched_clock_read+0x14/0x40
[12084.045478]  __x64_sys_io_submit+0x83/0x160
[12084.045483]  ? syscall_enter_from_user_mode+0x1d/0x50
[12084.045489]  do_syscall_64+0x3b/0x90
[12084.045517]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[12084.045521] RIP: 0033:0x7fa76511af79
[12084.045525] RSP: 002b:00007ffd6d6b9058 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
[12084.045530] RAX: ffffffffffffffda RBX: 00007fa75ba6e760 RCX: 00007fa76511af79
[12084.045532] RDX: 0000557b304ff3f0 RSI: 0000000000000001 RDI: 00007fa75ba4c000
[12084.045535] RBP: 00007fa75ba4c000 R08: 00007fa751b76000 R09: 0000000000000330
[12084.045537] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[12084.045540] R13: 0000000000000000 R14: 0000557b304ff3f0 R15: 0000557b30521eb0
[12084.045561]  </TASK>

Fix this issue by always reserving data space before locking a file range
at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't
error out immediately - instead after locking the file range, check if we
can do a NOCOW write, and if we can we don't error out since we don't need
to allocate a data extent, however if we can't NOCOW then error out with
-ENOSPC. This also implies that we may end up reserving space when it's
not needed because the write will end up being done in NOCOW mode - in that
case we just release the space after we noticed we did a NOCOW write - this
is the same type of logic that is done in the path for buffered IO writes.

Fixes: f0bfa76a11 ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
CC: stable@vger.kernel.org # 5.17+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
Goldwyn Rodrigues
1d8fa2e29b btrfs: derive compression type from extent map during reads
Derive the compression type from extent map as opposed to the bio flags
passed. This makes it more precise and not reliant on function
parameters.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
Gabriel Niebler
48b36a602a btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray
… rename it to simply fs_roots and adjust all usages of this object to use
the XArray API, because it is notionally easier to use and understand, as
it provides array semantics, and also takes care of locking for us,
further simplifying the code.

Also do some refactoring, esp. where the API change requires largely
rewriting some functions, anyway.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:15:57 +02:00
Gabriel Niebler
253bf57555 btrfs: turn delayed_nodes_tree into an XArray
… in the btrfs_root struct and adjust all usages of this object to use
the XArray API, because it is notionally easier to use and understand,
as it provides array semantics, and also takes care of locking for us,
further simplifying the code.

Also use the opportunity to do some light refactoring.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:16 +02:00
Christoph Hellwig
ad357938c6 btrfs: do not return errors from submit_bio_hook_t instances
Both btrfs_repair_one_sector and submit_bio_one as the direct caller of
one of the instances ignore errors as they expect the methods themselves
to call ->bi_end_io on error.  Remove the unused and dangerous return
value.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:14 +02:00
Christoph Hellwig
cb4411dd57 btrfs: do not return errors from btrfs_submit_compressed_read
btrfs_submit_compressed_read already calls ->bi_end_io on error and
the caller must ignore the return value, so remove it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:14 +02:00
Christoph Hellwig
7aab8b3282 btrfs: move btrfs_readpage to extent_io.c
Keep btrfs_readpage next to btrfs_do_readpage and the other address
space operations.  This allows to keep submit_one_bio and
struct btrfs_bio_ctrl file local in extent_io.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:14 +02:00
Filipe Manana
2306e83e73 btrfs: avoid double search for block group during NOCOW writes
When doing a NOCOW write, either through direct IO or buffered IO, we do
two lookups for the block group that contains the target extent: once
when we call btrfs_inc_nocow_writers() and then later again when we call
btrfs_dec_nocow_writers() after creating the ordered extent.

The lookups require taking a lock and navigating the red black tree used
to track all block groups, which can take a non-negligible amount of time
for a large filesystem with thousands of block groups, as well as lock
contention and cache line bouncing.

Improve on this by having a single block group search: making
btrfs_inc_nocow_writers() return the block group to its caller and then
have the caller pass that block group to btrfs_dec_nocow_writers().

This is part of a patchset comprised of the following patches:

  btrfs: remove search start argument from first_logical_byte()
  btrfs: use rbtree with leftmost node cached for tracking lowest block group
  btrfs: use a read/write lock for protecting the block groups tree
  btrfs: return block group directly at btrfs_next_block_group()
  btrfs: avoid double search for block group during NOCOW writes

The following test was used to test these changes from a performance
perspective:

   $ cat test.sh
   #!/bin/bash

   modprobe null_blk nr_devices=0

   NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
   mkdir $NULL_DEV_PATH
   if [ $? -ne 0 ]; then
       echo "Failed to create nullb0 directory."
       exit 1
   fi
   echo 2 > $NULL_DEV_PATH/submit_queues
   echo 16384 > $NULL_DEV_PATH/size # 16G
   echo 1 > $NULL_DEV_PATH/memory_backed
   echo 1 > $NULL_DEV_PATH/power

   DEV=/dev/nullb0
   MNT=/mnt/nullb0
   LOOP_MNT="$MNT/loop"
   MOUNT_OPTIONS="-o ssd -o nodatacow"
   MKFS_OPTIONS="-R free-space-tree -O no-holes"

   cat <<EOF > /tmp/fio-job.ini
   [io_uring_writes]
   rw=randwrite
   fsync=0
   fallocate=posix
   group_reporting=1
   direct=1
   ioengine=io_uring
   iodepth=64
   bs=64k
   filesize=1g
   runtime=300
   time_based
   directory=$LOOP_MNT
   numjobs=8
   thread
   EOF

   echo performance | \
       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

   echo
   echo "Using config:"
   echo
   cat /tmp/fio-job.ini
   echo

   umount $MNT &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
   mount $MOUNT_OPTIONS $DEV $MNT

   mkdir $LOOP_MNT

   truncate -s 4T $MNT/loopfile
   mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
   mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT

   # Trigger the allocation of about 3500 data block groups, without
   # actually consuming space on underlying filesystem, just to make
   # the tree of block group large.
   fallocate -l 3500G $LOOP_MNT/filler

   fio /tmp/fio-job.ini

   umount $LOOP_MNT
   umount $MNT

   echo 0 > $NULL_DEV_PATH/power
   rmdir $NULL_DEV_PATH

The test was run on a non-debug kernel (Debian's default kernel config),
the result were the following.

Before patchset:

  WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec

After patchset:

  WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec

  +3.3% write throughput and +3.3% IO done in the same time period.

The test has somewhat limited coverage scope, as with only NOCOW writes
we get less contention on the red black tree of block groups, since we
don't have the extra contention caused by COW writes, namely when
allocating data extents, pinning and unpinning data extents, but on the
hand there's access to tree in the NOCOW path, when incrementing a block
group's number of NOCOW writers.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Qu Wenruo
c9583ada8c btrfs: avoid double clean up when submit_one_bio() failed
[BUG]
When running generic/475 with 64K page size and 4K sector size, it has a
very high chance (almost 100%) to hang, with mostly data page locked but
no one is going to unlock it.

[CAUSE]
With commit 1784b7d502 ("btrfs: handle csum lookup errors properly on
reads"), if we failed to lookup checksum due to metadata IO error, we
will return error for btrfs_submit_data_bio().

This will cause the page to be unlocked twice in btrfs_do_readpage():

 btrfs_do_readpage()
 |- submit_extent_page()
 |  |- submit_one_bio()
 |     |- btrfs_submit_data_bio()
 |        |- if (ret) {
 |        |-     bio->bi_status = ret;
 |        |-     bio_endio(bio); }
 |               In the endio function, we will call end_page_read()
 |               and unlock_extent() to cleanup the subpage range.
 |
 |- if (ret) {
 |-        unlock_extent(); end_page_read() }
           Here we unlock the extent and cleanup the subpage range
           again.

For unlock_extent(), it's mostly double unlock safe.

But for end_page_read(), it's not, especially for subpage case,
as for subpage case we will call btrfs_subpage_end_reader() to reduce
the reader number, and use that to number to determine if we need to
unlock the full page.

If double accounted, it can underflow the number and leave the page
locked without anyone to unlock it.

[FIX]
The commit 1784b7d502 ("btrfs: handle csum lookup errors properly on
reads") itself is completely fine, it's our existing code not properly
handling the error from bio submission hook properly.

This patch will make submit_one_bio() to return void so that the callers
will never be able to do cleanup when bio submission hook fails.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Filipe Manana
490243884e btrfs: use BTRFS_DIR_START_INDEX at btrfs_create_new_inode()
We are still using the magic value of 2 at btrfs_create_new_inode(), but
there's now a constant for that, named BTRFS_DIR_START_INDEX, which was
introduced in commit 528ee69712 ("btrfs: put initial index value of a
directory in a constant"). So change that to use the constant.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:13 +02:00
Filipe Manana
a7bb6bd4bd btrfs: do not test for free space inode during NOCOW check against file extent
When checking if we can do a NOCOW write against a range covered by a file
extent item, we do a quick a check to determine if the inode's root was
snapshotted in a generation older than the generation of the file extent
item or not. This is to quickly determine if the extent is likely shared
and avoid the expensive check for cross references (this was added in
commit 78d4295b1e ("btrfs: lift some btrfs_cross_ref_exist checks in
nocow path").

We restrict that check to the case where the inode is not a free space
inode (since commit 27a7ff554e ("btrfs: skip file_extent generation
check for free_space_inode in run_delalloc_nocow")). That is because when
we had the inode cache feature, inode caches were backed by a free space
inode that belonged to the inode's root.

However we don't have support for the inode cache feature since kernel
5.11, so we don't need this check anymore since free space inodes are
now always related to free space caches, which are always associated to
the root tree (which can't be snapshotted, and its last_snapshot field
is always 0).

So remove that condition.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Filipe Manana
619104ba45 btrfs: move common NOCOW checks against a file extent into a helper
Verifying if we can do a NOCOW write against a range fully or partially
covered by a file extent item requires verifying several constraints, and
these are currently duplicated at two different places: can_nocow_extent()
and run_delalloc_nocow().

This change moves those checks into a common helper function to avoid
duplication. It adds some comments and also preserves all existing
behaviour like for example can_nocow_extent() treating errors from the
calls to btrfs_cross_ref_exist() and csum_exist_in_range() as meaning
we can not NOCOW, instead of propagating the error back to the caller.
That specific behaviour is questionable but also reasonable to some
degree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Sweet Tea Dorminy
dd137dd1f2 btrfs: factor out allocating an array of pages
Several functions currently populate an array of page pointers one
allocated page at a time. Factor out the common code so as to allow
improvements to all of the sites at once.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Yu Zhe
0d031dc4aa btrfs: remove unnecessary type casts
Explicit type casts are not necessary when it's void* to another pointer
type.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Qu Wenruo
fbca46eb46 btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.

When other page size come, especially like 16K, the limitation is a bit
limiting.

To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine.  By this, we can allow 4K sectorsize on 16K page
size.

Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.

Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.

Please note that, this patch will not yet enable other page size support
yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Qu Wenruo
b06660b595 btrfs: replace memset with memzero_page in data checksum verification
The original code resets the page to 0x1 for not apparent reason, it's
been like that since the initial 2007 code added in commit 07157aacb1
("Btrfs: Add file data csums back in via hooks in the extent map code").

It could mean that a failed buffer can be detected from the data but
that's just a guess and any value is good.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
d4135134ab btrfs: avoid blocking on space revervation when doing nowait dio writes
When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
proceed with the non-blocking, NOWAIT path. However reserving the metadata
space and qgroup meta space can often result in blocking - flushing
delalloc, wait for ordered extents to complete, trigger transaction
commits, etc, going against the semantics of a NOWAIT write.

So make the NOWAIT write path to try to reserve all the metadata it needs
without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
then return -EAGAIN to make the caller fallback to a blocking direct IO
write.

This is part of a patchset comprised of the following patches:

  btrfs: avoid blocking on page locks with nowait dio on compressed range
  btrfs: avoid blocking nowait dio when locking file range
  btrfs: avoid double nocow check when doing nowait dio writes
  btrfs: stop allocating a path when checking if cross reference exists
  btrfs: free path at can_nocow_extent() before checking for checksum items
  btrfs: release path earlier at can_nocow_extent()
  btrfs: avoid blocking when allocating context for nowait dio read/write
  btrfs: avoid blocking on space revervation when doing nowait dio writes

The following test was run before and after applying this patchset:

  $ cat io-uring-nodatacow-test.sh
  #!/bin/bash

  DEV=/dev/sdc
  MNT=/mnt/sdc

  MOUNT_OPTIONS="-o ssd -o nodatacow"
  MKFS_OPTIONS="-R free-space-tree -O no-holes"

  NUM_JOBS=4
  FILE_SIZE=8G
  RUN_TIME=300

  cat <<EOF > /tmp/fio-job.ini
  [io_uring_rw]
  rw=randrw
  fsync=0
  fallocate=posix
  group_reporting=1
  direct=1
  ioengine=io_uring
  iodepth=64
  bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
  filesize=$FILE_SIZE
  runtime=$RUN_TIME
  time_based
  filename=foobar
  directory=$MNT
  numjobs=$NUM_JOBS
  thread
  EOF

  echo performance | \
     tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $MNT &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  fio /tmp/fio-job.ini

  umount $MNT

The test was run a 12 cores box with 64G of ram, using a non-debug kernel
config (Debian's default config) and a spinning disk.

Result before the patchset:

 READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec

Result after the patchset:

 READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec

That's about +7.2% throughput for reads and +6.9% for writes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
4f208dcc6b btrfs: avoid blocking when allocating context for nowait dio read/write
When doing a NOWAIT direct IO read/write, we allocate a context object
(struct btrfs_dio_data) with GFP_NOFS, which can result in blocking
waiting for memory allocation (GFP_NOFS is __GFP_RECLAIM | __GFP_IO).
This is undesirable for the NOWAIT semantics, so do the allocation with
GFP_NOWAIT if we are serving a NOWAIT request and if the allocation fails
return -EAGAIN, so that the caller can fallback to a blocking context and
retry with a non-blocking write.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
59d35c5171 btrfs: release path earlier at can_nocow_extent()
At can_nocow_extent(), we are releasing the path only after checking if
the block group that has the target extent is read only, and after
checking if there's delalloc in the range in case our extent is a
preallocated extent. The read only extent check can be expensive if we
have a very large filesystem with many block groups, as well as the
check for delalloc in the inode's io_tree in case the io_tree is big
due to IO on other file ranges.

Our path is holding a read lock on a leaf and there's no need to keep
the lock while doing those two checks, so release the path before doing
them, immediately after the last use of the leaf.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
c1a548db25 btrfs: free path at can_nocow_extent() before checking for checksum items
When we look for checksum items, through csum_exist_in_range(), at
can_nocow_extent(), we no longer need the path that we have previously
allocated. Through csum_exist_in_range() -> btrfs_lookup_csums_range(),
we also end up allocating a path, so we are adding unnecessary extra
memory usage. So free the path before calling csum_exist_in_range().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
1a89f17386 btrfs: stop allocating a path when checking if cross reference exists
At btrfs_cross_ref_exist() we always allocate a path, but we really don't
need to because all its callers (only 2) already have an allocated path
that is not being used when they call btrfs_cross_ref_exist(). So change
btrfs_cross_ref_exist() to take a path as an argument and update both
its callers to pass in the unused path they have when they call
btrfs_cross_ref_exist().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
d7a8ab4e9b btrfs: avoid double nocow check when doing nowait dio writes
When doing a NOWAIT direct IO write we are checking twice if we can COW
into the target file range using can_nocow_extent() - once at the very
beginning of the write path, at btrfs_write_check() via
check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write().

The can_nocow_extent() function does a lot of expensive things - searching
for the file extent item in the inode's subvolume tree, searching for the
extent item in the extent tree, checking delayed references, etc, so it
isn't a very cheap call.

We can remove the first check at btrfs_write_check(), and add there a
quick check to verify if the inode has the NODATACOW or PREALLOC flags,
and quickly bail out if it doesn't have neither of those flags, as that
means we have to COW and therefore can't comply with the NOWAIT semantics.

After this we do only one call to can_nocow_extent(), while we are at
btrfs_get_blocks_direct_write(), where we have already locked the file
range and we did a try lock on the range before, at
btrfs_dio_iomap_begin() (since the previous patch in the series).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:10 +02:00
Filipe Manana
5909440344 btrfs: avoid blocking nowait dio when locking file range
If we are doing a NOWAIT direct IO read/write, we can block when locking
the file range at btrfs_dio_iomap_begin(), as it's possible the range (or
a part of it) is already locked by another task (mmap writes, another
direct IO read/write racing with us, fiemap, etc). We are also waiting for
completion of any ordered extent we find in the range, which also can
block us for a significant amount of time.

There's also the incorrect fallback to buffered IO (returning -ENOTBLK)
when we are dealing with a NOWAIT request and we can't proceed. In this
case we should be returning -EAGAIN, as falling back to buffered IO can
result in blocking for many different reasons, so that the caller can
delegate a retry to a context where blocking is more acceptable.

Fix these cases by:

1) Doing a try lock on the file range and failing with -EAGAIN if we
   can not lock right away;

2) Fail with -EAGAIN if we find an ordered extent;

3) Return -EAGAIN instead of -ENOTBLK when we need to fallback to
   buffered IO and we have a NOWAIT request.

This will also allow us to avoid a duplicated check that verifies if we
are able to do a NOCOW write for NOWAIT direct IO writes, done in the
next patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:09 +02:00
Filipe Manana
b023e67512 btrfs: avoid blocking on page locks with nowait dio on compressed range
If we are doing NOWAIT direct IO read/write and our inode has compressed
extents, we call filemap_fdatawrite_range() against the range in order
to wait for compressed writeback to complete, since the generic code at
iomap_dio_rw() calls filemap_write_and_wait_range() once, which is not
enough to wait for compressed writeback to complete.

This call to filemap_fdatawrite_range() can block on page locks, since
the first writepages() on a range that we will try to compress results
only in queuing a work to compress the data while holding the pages
locked.

Even though the generic code at iomap_dio_rw() will do the right thing
and return -EAGAIN for NOWAIT requests in case there are pages in the
range, we can still end up at btrfs_dio_iomap_begin() with pages in the
range because either of the following can happen:

1) Memory mapped writes, as we haven't locked the range yet;

2) Buffered reads might have started, which lock the pages, and we do
   the filemap_fdatawrite_range() call before locking the file range.

So don't call filemap_fdatawrite_range() at btrfs_dio_iomap_begin() if we
are doing a NOWAIT read/write. Instead call filemap_range_needs_writeback()
to check if there are any locked, dirty, or under writeback pages, and
return -EAGAIN if that's the case.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:09 +02:00
Filipe Manana
63c34cb4c6 btrfs: add and use helper to assert an inode range is clean
We have four different scenarios where we don't expect to find ordered
extents after locking a file range:

1) During plain fallocate;
2) During hole punching;
3) During zero range;
4) During reflinks (both cloning and deduplication).

This is because in all these cases we follow the pattern:

1) Lock the inode's VFS lock in exclusive mode;

2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
   mmap writes;

3) Flush delalloc in a file range and wait for all ordered extents
   to complete - both done through btrfs_wait_ordered_range();

4) Lock the file range in the inode's io_tree.

So add a helper that asserts that we don't have ordered extents for a
given range. Make the four scenarios listed above use this helper after
locking the respective file range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:09 +02:00
Sweet Tea Dorminy
6c3636ebe3 btrfs: restore inode creation before xattr setting
According to the tree checker, "all xattrs with a given objectid follow
the inode with that objectid in the tree" is an invariant. This was
broken by the recent change "btrfs: move common inode creation code into
btrfs_create_new_inode()", which moved acl creation and property
inheritance (stored in xattrs) to before inode insertion into the tree.
As a result, under certain timings, the xattrs could be written to the
tree before the inode, causing the tree checker to report violation of
the invariant.

Move property inheritance and acl creation back to their old ordering
after the inode insertion.

Suggested-by: Omar Sandoval <osandov@osandov.com>
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:09 +02:00
Omar Sandoval
caae78e032 btrfs: move common inode creation code into btrfs_create_new_inode()
All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:

1. It reduces code duplication.

2. It allows us to set up the inode completely before inserting the
   inode item, removing calls to btrfs_update_inode().

3. It fixes a leak of an inode on disk in some error cases. For example,
   in btrfs_create(), if btrfs_new_inode() succeeds, then we have
   inserted an inode item and its inode ref. However, if something after
   that fails (e.g., btrfs_init_inode_security()), then we end the
   transaction and then decrement the link count on the inode. If the
   transaction is committed and the system crashes before the failed
   inode is deleted, then we leak that inode on disk. Instead, this
   refactoring aborts the transaction when we can't recover more
   gracefully.

4. It exposes various ways that subvolume creation diverges from mkdir
   in terms of inheriting flags, properties, permissions, and POSIX
   ACLs, a lot of which appears to be accidental. This patch explicitly
   does _not_ change the existing non-standard behavior, but it makes
   those differences more clear in the code and documents them so that
   we can discuss whether they should be changed.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Omar Sandoval
3538d68dbd btrfs: reserve correct number of items for inode creation
The various inode creation code paths do not account for the compression
property, POSIX ACLs, or the parent inode item when starting a
transaction. Fix it by refactoring all of these code paths to use a new
function, btrfs_new_inode_prepare(), which computes the correct number
of items. To do so, it needs to know whether POSIX ACLs will be created,
so move the ACL creation into that function. To reduce the number of
arguments that need to be passed around for inode creation, define
struct btrfs_new_inode_args containing all of the relevant information.

btrfs_new_inode_prepare() will also be a good place to set up the
fscrypt context and encrypted filename in the future.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Omar Sandoval
5f465bf1f1 btrfs: factor out common part of btrfs_{mknod,create,mkdir}()
btrfs_{mknod,create,mkdir}() are now identical other than the inode
initialization and some inconsequential function call order differences.
Factor out the common code to reduce code duplication.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Omar Sandoval
a1fd0c35ff btrfs: allocate inode outside of btrfs_new_inode()
Instead of calling new_inode() and inode_init_owner() inside of
btrfs_new_inode(), do it in the callers. This allows us to pass in just
the inode instead of the mnt_userns and mode and removes the need for
memalloc_nofs_{save,restores}() since we do it before starting a
transaction. In create_subvol(), it also means we no longer have to look
up the inode again to instantiate it. This also paves the way for some
more cleanups in later patches.

This also removes the comments about Smack checking i_op, which are no
longer true since commit 5d6c31910b ("xattr: Add
__vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
IOP_XATTR, which is set based on sb->s_xattr.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Gabriel Niebler
a8ce68fd04 btrfs: use btrfs_for_each_slot in btrfs_real_readdir
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Omar Sandoval
305eaac009 btrfs: set inode flags earlier in btrfs_new_inode()
btrfs_new_inode() inherits the inode flags from the parent directory and
the mount options _after_ we fill the inode item. This works because all
of the callers of btrfs_new_inode() make further changes to the inode
and then call btrfs_update_inode(). It'd be better to fully initialize
the inode once to avoid the extra update, so as a first step, set the
inode flags _before_ filling the inode item.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
6437d45835 btrfs: move btrfs_get_free_objectid() call into btrfs_new_inode()
Every call of btrfs_new_inode() is immediately preceded by a call to
btrfs_get_free_objectid(). Since getting an inode number is part of
creating a new inode, this is better off being moved into
btrfs_new_inode(). While we're here, get rid of the comment about
reclaiming inode numbers, since we only did that when using the ino
cache, which was removed by commit 5297199a8b ("btrfs: remove inode
number cache feature").

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
23c24ef8e4 btrfs: don't pass parent objectid to btrfs_new_inode() explicitly
For everything other than a subvolume root inode, we get the parent
objectid from the parent directory. For the subvolume root inode, the
parent objectid is the same as the inode's objectid. We can find this
within btrfs_new_inode() instead of passing it.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
c51fa51190 btrfs: remove unnecessary set_nlink() in btrfs_create_subvol_root()
btrfs_new_inode() already returns an inode with nlink set to 1 (via
inode_init_always()). Get rid of the unnecessary set.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
6d831f7ef9 btrfs: remove unnecessary inode_set_bytes(0) call
new_inode() always returns an inode with i_blocks and i_bytes set to 0
(via inode_init_always()). Remove the unnecessary call to
inode_set_bytes() in btrfs_new_inode().

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
9124e15f27 btrfs: remove unnecessary btrfs_i_size_write(0) calls
btrfs_new_inode() always returns an inode with i_size and disk_i_size
set to 0 (via inode_init_always() and btrfs_alloc_inode(),
respectively). Remove the unnecessary calls to btrfs_i_size_write() in
btrfs_mkdir() and btrfs_create_subvol_root().

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
81512e89f2 btrfs: get rid of btrfs_add_nondir()
This is a trivial wrapper around btrfs_add_link(). The only thing it
does other than moving arguments around is translating a > 0 return
value to -EEXIST. As far as I can tell, btrfs_add_link() won't return >
0 (and if it did, the existing callsites in, e.g., btrfs_mkdir() would
be broken). The check itself dates back to commit 2c90e5d658 ("Btrfs:
still corruption hunting"), so it's probably left over from debugging.
Let's just get rid of btrfs_add_nondir().

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
c162187143 btrfs: reserve correct number of items for rename
btrfs_rename() and btrfs_rename_exchange() don't account for enough
items. Replace the incorrect explanations with a specific breakdown of
the number of items and account them accurately.

Note that this glosses over RENAME_WHITEOUT because the next commit is
going to rework that, too.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
bca4ad7c0b btrfs: reserve correct number of items for unlink and rmdir
__btrfs_unlink_inode() calls btrfs_update_inode() on the parent
directory in order to update its size and sequence number. Make sure we
account for it.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:05 +02:00
Linus Torvalds
9050ba3a61 for-5.18-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJwBoQACgkQxWXV+ddt
 WDu/FQ/+L2LpN5Zu1NkjOAh2Lcvz5RYZjcVext4bbPUW2yhXYH3e6836R/feLOCG
 RxRICOAQhJ7I6ct/N1aToI2AbjWnMSJK+IgageA1UdIS8McbcSP/qJOYwJ/+2Xhl
 AvK5psoj+qwGbTI9e0luNe6b+UWTGIMyYRjmN0SlkBOdg9/xqQpBMQxfKJumMvEc
 3ZwLpcNjhUUwdFKHvHZNCOQhZiZwloKFeq9MLaEL5LO30wKSY6ShiCA3pafFoVN0
 mvEEVtIGgZUsgeQTzSzD8UhGDvZtZ1+aaX0dcNMRzwI2h2pkBmPkd5QtFM9Qs0xP
 hGibSN9bC/SzQyE9v4cKohwS+g4dE4r+dUWFNpdZLIOpBt5PJBDA0tjcjxquFtMr
 6JX77coAl9kt0jspBmHVPb3qmIc1Xo3Iw2PrVgTK14QUo46XwF5Rga68wKOfNt0u
 LbD9+KCLnwxoOhvXh/LJX6nvPT8tuMrT/5AOXULI2oMCnpCY6Vl+kX8zFlAfbDk1
 d4/jy42bgHCso60j2vAcdQmZB+/snpboOhKJwkE2FqOTs+hBR8PBln6BqEt5xkHZ
 q2mBfYDujnmZWaiAU6+ETjOcCooWQioi3335tp/C9TdOrIkFDij1ztYRxhgP0g0X
 w6d6wlbkcol1Zubb/zEiBiwe+6GR/KtNYc4PBEfVe5Vw0npFreU=
 =7k9f
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes mostly around how some file attributes could be set.

   - fix handling of compression property:
      - don't allow setting it on anything else than regular file or
        directory
      - do not allow setting it on nodatacow files via properties

   - improved error handling when setting xattr

   - make sure symlinks are always properly logged"

* tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: skip compression property for anything other than files and dirs
  btrfs: do not BUG_ON() on failure to update inode when setting xattr
  btrfs: always log symlinks in full mode
  btrfs: do not allow compression on nodatacow files
  btrfs: export a helper for compression hard check
2022-05-02 10:09:02 -07:00
Chung-Chiang Cheng
e6f9d69648 btrfs: export a helper for compression hard check
inode_can_compress will be used outside of inode.c to check the
availability of setting compression flag by xattr. This patch moves
this function as an internal helper and renames it to
btrfs_inode_can_compress.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27 22:15:40 +02:00
Linus Torvalds
fd574a2f84 for-5.18-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJnGGIACgkQxWXV+ddt
 WDumDw//cE1NcawdnVkEaKr20PetHfzPyFSIIr17nedtnVvWYyOFF/0uJlHNhv8Z
 CZIfJ7fmH/pO5oWPXN84wKNfumDWNwc36QrvoXC67TrKUSiBN8BzL83HvAjGwYFH
 G+LfZXGnVbqq8F1iYkIsuH0Oo1x/N/LPM3s6iZy3O4l8s96u+J4GRnc8Tr0AH4MA
 zgz3fab8Ec378HTG9fvdAQNLxFEe0VatD6WrzILnmM8UgeQK7g73dqH9Ni9gz2DW
 2GDlO6aevQ1G6dm2AJ0ItExnbHH7TfOThkG56Gdqrzb/d39GzrVpeob7QiorETus
 EWS1rXaeikUiD4Bzt/RszUNL80yMN1DjcN3QBkiDf3ShSDFteoHMPw3e6jcQCy1m
 Dxf5oditQqltuFNLeSiVbZEMw2kXqBP7RoPiirF9rdvrDNLHhAE9wu0kpSGSSvT7
 Tyu9JyLw2axU6wGTi1GHAXurlW2ItRRyFAewWWul1lLkuz/6YXI4F/EHm3Mbh6Nh
 pMIFMNr4Oafdx+3Ful8ZA4PynirXub/xVDefcFBibz/PTGEnHG4ZVzRudmVnowh7
 GP2pql1+Y/TFkXdD98V8GWD+E10JAmNCkQSoiggJooNWR28whukmDVX/HY8lGmWg
 DjxwGkte3SltUBWNOTGnO7546hMwOxOPZHENPh+gffYkeMeIxPI=
 =xDWz
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - direct IO fixes:

      - restore passing file offset to correctly calculate checksums
        when repairing on read and bio split happens

      - use correct bio when sumitting IO on zoned filesystem

 - zoned mode fixes:

      - fix selection of device to correctly calculate device
        capabilities when allocating a new bio

      - use a dedicated lock for exclusion during relocation

      - fix leaked plug after failure syncing log

 - fix assertion during scrub and relocation

* tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: use dedicated lock for data relocation
  btrfs: fix assertion failure during scrub due to block group reallocation
  btrfs: fix direct I/O writes for split bios on zoned devices
  btrfs: fix direct I/O read repair for split bios
  btrfs: fix and document the zoned device choice in alloc_new_bio
  btrfs: fix leaked plug after failure syncing log on zoned filesystems
2022-04-26 11:10:42 -07:00
Christoph Hellwig
0fdf977d45 btrfs: fix direct I/O writes for split bios on zoned devices
When a bio is split in btrfs_submit_direct, dip->file_offset contains
the file offset for the first bio.  But this means the start value used
in btrfs_end_dio_bio to record the write location for zone devices is
incorrect for subsequent bios.

CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19 15:45:04 +02:00
Christoph Hellwig
00d825258b btrfs: fix direct I/O read repair for split bios
When a bio is split in btrfs_submit_direct, dip->file_offset contains
the file offset for the first bio.  But this means the start value used
in btrfs_check_read_dio_bio is incorrect for subsequent bios.  Add
a file_offset field to struct btrfs_bio to pass along the correct offset.

Given that check_data_csum only uses start of an error message this
means problems with this miscalculation will only show up when I/O fails
or checksums mismatch.

The logic was removed in f4f39fc5dc ("btrfs: remove btrfs_bio::logical
member") but we need it due to the bio splitting.

CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19 15:44:56 +02:00
Linus Torvalds
722985e2f6 for-5.18-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJUfvwACgkQxWXV+ddt
 WDtiuA//csj0CrJq7wyRvgDkbPtCCMyDtL4zfmjP4s++NWaMDOKTxBU8msuGUJgR
 Xribel2zWqlFiOzptd9sGEfxfKfz1p5Rm/gtFj57WVSGV7YtiAGyFGuzn/vpCrgq
 NP5LFY2z5N36VxDXUPKHzvdqczO5W2n9KdaysJCr6FpCz+vVrplFiT5U+X175Sgg
 zWS/XrPIHYbtaEFdb3rUKol6riu7vXW3MxEA9di8K4Xo9TJrp0NtBoGZDsWFQxf7
 vfVwtYsQPsACJxw+MjBcVQ5fNXO5iATL1JfBb9ltN589xouja7bDCb40Fm2gEwWB
 IUatnCq/4MN2S2NdPtEcXQ52W9svT/87z6ZblefSEiQqvQBBJHN131RTM/s8LBG4
 fkHcGV6PsiutSIFycrID0bpXr1Mhvg2aMjxdriLPBtYopaMPh+ivK6LPYoE5MggQ
 /rshWfjiWJhPKXPsng+H7UGbViOiYeG0kUBuaqFx4ARnESpN1gF2dJRYvXYFL/8a
 Q0wmLr1tf3M82VAaAFNOl/BVk8blutCSHcJDLDKxcl3DhVVlY5J8onEPXjneJRkk
 rRB3zoxLlptgfW75CPNyFrpicPdAXzGCoccienIUoEdHSX/W5rNA4L6XpVE5Hv94
 dWR1aVAbCUcuhY1QfBU7H2iYw0RHzqDO3+IRfqJuYsLGciijLbs=
 =APMY
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more code and warning fixes.

  There's one feature ioctl removal patch slated for 5.18 that did not
  make it to the main pull request. It's just a one-liner and the ioctl
  has a v2 that's in use for a long time, no point to postpone it to
  5.19.

  Late update:

   - remove balance v1 ioctl, superseded by v2 in 2012

  Fixes:

   - add back cgroup attribution for compressed writes

   - add super block write start/end annotations to asynchronous balance

   - fix root reference count on an error handling path

   - in zoned mode, activate zone at the chunk allocation time to avoid
     ENOSPC due to timing issues

   - fix delayed allocation accounting for direct IO

  Warning fixes:

   - simplify assertion condition in zoned check

   - remove an unused variable"

* tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix btrfs_submit_compressed_write cgroup attribution
  btrfs: fix root ref counts in error handling in btrfs_get_root_ref
  btrfs: zoned: activate block group only for extent allocation
  btrfs: return allocated block group from do_chunk_alloc()
  btrfs: mark resumed async balance as writing
  btrfs: remove support of balance v1 ioctl
  btrfs: release correct delalloc amount in direct IO write path
  btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups()
  btrfs: zoned: remove redundant condition in btrfs_run_delalloc_range
2022-04-14 10:58:27 -07:00