Commit Graph

949 Commits

Author SHA1 Message Date
Filipe Manana
7f458a3873 btrfs: fix race when defragmenting leads to unnecessary IO
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).

So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.

I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:08 +01:00
Josef Bacik
b9729ce014 btrfs: locking: rip out path->leave_spinning
We no longer distinguish between blocking and spinning, so rip out all
this code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Matthew Wilcox (Oracle)
a1fbc6750e btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch
On 32-bit systems, this shift will overflow for files larger than 4GB as
start_index is unsigned long while the calls to btrfs_delalloc_*_space
expect u64.

CC: stable@vger.kernel.org # 4.4+
Fixes: df480633b8 ("btrfs: extent-tree: Switch to new delalloc space reserve and release")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ define the variable instead of repeating the shift ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-05 13:01:42 +01:00
Nikolay Borisov
c0a4360305 btrfs: remove inode argument from btrfs_start_ordered_extent
The passed in ordered_extent struct is always well-formed and contains
the inode making the explicit argument redundant.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:22 +02:00
Josef Bacik
7280490500 btrfs: kill the RCU protection for fs_info->space_info
We have this thing wrapped in an RCU lock, but it's really not needed.
We create all the space_info's on mount, and we destroy them on unmount.
The list never changes and we're protected from messing with it by the
normal mount/umount path, so kill the RCU stuff around it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:19 +02:00
Goldwyn Rodrigues
66a2823c54 btrfs: sysfs: export currently running exclusive operation
/sys/fs/<fsid>/exclusive_operation contains the currently executing
exclusive operation. Add a sysfs_notify() when operation end, so
userspace can be notified of exclusive operation is finished.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:20 +02:00
Goldwyn Rodrigues
c3e1f96c37 btrfs: enumerate the type of exclusive operation in progress
Instead of using a flag bit for exclusive operation, use a variable to
store which exclusive operation is being performed.  Introduce an API
to start and finish an exclusive operation.

This would enable another way for tools to check which operation is
running on why starting an exclusive operation failed. The followup
patch adds a sysfs_notify() to alert userspace when the state changes, so
userspace can perform select() on it to get notified of the change.

This would enable us to enqueue a command which will wait for current
exclusive operation to complete before issuing the next exclusive
operation. This has been done synchronously as opposed to a background
process, or else error collection (if any) will become difficult.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:20 +02:00
Josef Bacik
9631e4cc1a btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks
When we COW a block we are holding a lock on the original block, and
then we lock the new COW block.  Because our lockdep maps are based on
root + level, this will make lockdep complain.  We need a way to
indicate a subclass for locking the COW'ed block, so plumb through our
btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.

The reason I've added all this extra infrastructure is because there
will be need of different nesting classes in follow up patches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:16 +02:00
Qu Wenruo
e85fde5162 btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations
[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:

  generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)

And with the following metadata leak:

  BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
  Call Trace:
   btrfs_put_super+0x15/0x17 [btrfs]
   generic_shutdown_super+0x72/0x110
   kill_anon_super+0x18/0x30
   btrfs_kill_super+0x17/0x30 [btrfs]
   deactivate_locked_super+0x3b/0xa0
   deactivate_super+0x40/0x50
   cleanup_mnt+0x135/0x190
   __cleanup_mnt+0x12/0x20
   task_work_run+0x64/0xb0
   __prepare_exit_to_usermode+0x1bc/0x1c0
   __syscall_return_slowpath+0x47/0x230
   do_syscall_64+0x64/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ---[ end trace a6cfd45ba80e4e06 ]---
  BTRFS error (device dm-3): qgroup reserved space leaked
  BTRFS info (device dm-3): disk space caching is enabled
  BTRFS info (device dm-3): has skinny extents

[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:

  btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
  btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
  btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
  btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
  btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072

It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().

This leads to the leakage.

[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.

And free the qgroup reserved metadata space when releasing the
block_rsv.

To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.

Fixes: 733e03a0b2 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:13 +02:00
Josef Bacik
b49121393f btrfs: change nr to u64 in btrfs_start_delalloc_roots
We have btrfs_wait_ordered_roots() which takes a u64 for nr, but
btrfs_start_delalloc_roots() that takes an int for nr, which makes using
them in conjunction, especially for something like (u64)-1, annoying and
inconsistent.  Fix btrfs_start_delalloc_roots() to take a u64 for nr and
adjust start_delalloc_inodes() and it's callers appropriately.

This means we've adjusted start_delalloc_inodes() to take a pointer of
nr since we want to preserve the ability for start-delalloc_inodes() to
return an error, so simply make it do the nr adjusting as necessary.

Part of adjusting the callers to this means changing
btrfs_writeback_inodes_sb_nr() to take a u64 for items.  This may be
confusing because it seems unrelated, but the caller of
btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the
function variable that needs to be changed.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Filipe Manana
1c78544eaa btrfs: fix wrong address when faulting in pages in the search ioctl
When faulting in the pages for the user supplied buffer for the search
ioctl, we are passing only the base address of the buffer to the function
fault_in_pages_writeable(). This means that after the first iteration of
the while loop that searches for leaves, when we have a non-zero offset,
stored in 'sk_offset', we try to fault in a wrong page range.

So fix this by adding the offset in 'sk_offset' to the base address of the
user supplied buffer when calling fault_in_pages_writeable().

Several users have reported that the applications compsize and bees have
started to operate incorrectly since commit a48b73eca4 ("btrfs: fix
potential deadlock in the search ioctl") was added to stable trees, and
these applications make heavy use of the search ioctls. This fixes their
issues.

Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/
Link: https://github.com/kilobyte/compsize/issues/34
Fixes: a48b73eca4 ("btrfs: fix potential deadlock in the search ioctl")
CC: stable@vger.kernel.org # 4.4+
Tested-by: A L <mail@lechevalier.se>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-14 17:27:16 +02:00
Josef Bacik
a48b73eca4 btrfs: fix potential deadlock in the search ioctl
With the conversion of the tree locks to rwsem I got the following
lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
  ------------------------------------------------------
  compsize/11122 is trying to acquire lock:
  ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90

  but task is already holding lock:
  ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-fs-00){++++}-{3:3}:
	 down_write_nested+0x3b/0x70
	 __btrfs_tree_lock+0x24/0x120
	 btrfs_search_slot+0x756/0x990
	 btrfs_lookup_inode+0x3a/0xb4
	 __btrfs_update_delayed_inode+0x93/0x270
	 btrfs_async_run_delayed_root+0x168/0x230
	 btrfs_work_helper+0xd4/0x570
	 process_one_work+0x2ad/0x5f0
	 worker_thread+0x3a/0x3d0
	 kthread+0x133/0x150
	 ret_from_fork+0x1f/0x30

  -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_delayed_update_inode+0x50/0x440
	 btrfs_update_inode+0x8a/0xf0
	 btrfs_dirty_inode+0x5b/0xd0
	 touch_atime+0xa1/0xd0
	 btrfs_file_mmap+0x3f/0x60
	 mmap_region+0x3a4/0x640
	 do_mmap+0x376/0x580
	 vm_mmap_pgoff+0xd5/0x120
	 ksys_mmap_pgoff+0x193/0x230
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
	 __lock_acquire+0x1272/0x2310
	 lock_acquire+0x9e/0x360
	 __might_fault+0x68/0x90
	 _copy_to_user+0x1e/0x80
	 copy_to_sk.isra.32+0x121/0x300
	 search_ioctl+0x106/0x200
	 btrfs_ioctl_tree_search_v2+0x7b/0xf0
	 btrfs_ioctl+0x106f/0x30a0
	 ksys_ioctl+0x83/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-fs-00);
				 lock(&delayed_node->mutex);
				 lock(btrfs-fs-00);
    lock(&mm->mmap_lock#2);

   *** DEADLOCK ***

  1 lock held by compsize/11122:
   #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  stack backtrace:
  CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
  Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
  Call Trace:
   dump_stack+0x78/0xa0
   check_noncircular+0x165/0x180
   __lock_acquire+0x1272/0x2310
   lock_acquire+0x9e/0x360
   ? __might_fault+0x3e/0x90
   ? find_held_lock+0x72/0x90
   __might_fault+0x68/0x90
   ? __might_fault+0x3e/0x90
   _copy_to_user+0x1e/0x80
   copy_to_sk.isra.32+0x121/0x300
   ? btrfs_search_forward+0x2a6/0x360
   search_ioctl+0x106/0x200
   btrfs_ioctl_tree_search_v2+0x7b/0xf0
   btrfs_ioctl+0x106f/0x30a0
   ? __do_sys_newfstat+0x5a/0x70
   ? ksys_ioctl+0x83/0xc0
   ksys_ioctl+0x83/0xc0
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x50/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

The problem is we're doing a copy_to_user() while holding tree locks,
which can deadlock if we have to do a page fault for the copy_to_user().
This exists even without my locking changes, so it needs to be fixed.
Rework the search ioctl to do the pre-fault and then
copy_to_user_nofault for the copying.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-27 13:56:27 +02:00
David Sterba
f37c563bab btrfs: add missing check for nocow and compression inode flags
User Forza reported on IRC that some invalid combinations of file
attributes are accepted by chattr.

The NODATACOW and compression file flags/attributes are mutually
exclusive, but they could be set by 'chattr +c +C' on an empty file. The
nodatacow will be in effect because it's checked first in
btrfs_run_delalloc_range.

Extend the flag validation to catch the following cases:

  - input flags are conflicting
  - old and new flags are conflicting
  - initialize the local variable with inode flags after inode ls locked

Inode attributes take precedence over mount options and are an
independent setting.

Nocompress would be a no-op with nodatacow, but we don't want to mix
any compression-related options with nodatacow.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:44 +02:00
Johannes Thumshirn
49bac89768 btrfs: add metadata_uuid to FS_INFO ioctl
Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl.
This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in
btrfs_ioctl_fs_info_args::flags.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:43 +02:00
Johannes Thumshirn
0fb408a558 btrfs: add filesystem generation to FS_INFO ioctl
Add retrieval of the filesystem's generation to the fsinfo ioctl. This is
driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in
btrfs_ioctl_fs_info_args::flags.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:43 +02:00
Johannes Thumshirn
137c541821 btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl
With the recent addition of filesystem checksum types other than CRC32c,
it is not anymore hard-coded which checksum type a btrfs filesystem uses.

Up to now there is no good way to read the filesystem checksum, apart from
reading the filesystem UUID and then query sysfs for the checksum type.

Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
command which usually is used to query filesystem features. Also add a
flags member indicating that the kernel responded with a set csum_type and
csum_size field.

For compatibility reasons, only return the csum_type and csum_size if
the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
clear any unknown flags so we don't pass false positives to user-space
newer than the kernel.

To simplify further additions to the ioctl, also switch the padding to a
u8 array. Pahole was used to verify the result of this switch:

The csum members are added before flags, which might look odd, but this
is to keep the alignment requirements and not to introduce holes in the
structure.

  $ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
  struct btrfs_ioctl_fs_info_args {
	  __u64                      max_id;               /*     0     8 */
	  __u64                      num_devices;          /*     8     8 */
	  __u8                       fsid[16];             /*    16    16 */
	  __u32                      nodesize;             /*    32     4 */
	  __u32                      sectorsize;           /*    36     4 */
	  __u32                      clone_alignment;      /*    40     4 */
	  __u16                      csum_type;            /*    44     2 */
	  __u16                      csum_size;            /*    46     2 */
	  __u64                      flags;                /*    48     8 */
	  __u8                       reserved[968];        /*    56   968 */

	  /* size: 1024, cachelines: 16, members: 10 */
  };

Fixes: 3951e7f050 ("btrfs: add xxhash64 to checksumming algorithms")
Fixes: 3831bf0094 ("btrfs: add sha256 to checksumming algorithm")
CC: stable@vger.kernel.org # 5.5+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:43 +02:00
Qu Wenruo
2dfb1e43f5 btrfs: preallocate anon block device at first phase of snapshot creation
[BUG]
When the anonymous block device pool is exhausted, subvolume/snapshot
creation fails with EMFILE (Too many files open). This has been reported
by a user. The allocation happens in the second phase during transaction
commit where it's only way out is to abort the transaction

  BTRFS: Transaction aborted (error -24)
  WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
  RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
  Call Trace:
   create_pending_snapshots+0x82/0xa0 [btrfs]
   btrfs_commit_transaction+0x275/0x8c0 [btrfs]
   btrfs_mksubvol+0x4b9/0x500 [btrfs]
   btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
   btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
   btrfs_ioctl+0x11a4/0x2da0 [btrfs]
   do_vfs_ioctl+0xa9/0x640
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x1a/0x20
   do_syscall_64+0x5a/0x110
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ---[ end trace 33f2f83f3d5250e9 ]---
  BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
  BTRFS info (device sda1): forced readonly
  BTRFS warning (device sda1): Skipping commit of aborted transaction.
  BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown

[CAUSE]
When the global anonymous block device pool is exhausted, the following
call chain will fail, and lead to transaction abort:

 btrfs_ioctl_snap_create_v2()
 |- btrfs_ioctl_snap_create_transid()
    |- btrfs_mksubvol()
       |- btrfs_commit_transaction()
          |- create_pending_snapshot()
             |- btrfs_get_fs_root()
                |- btrfs_init_fs_root()
                   |- get_anon_bdev()

[FIX]
Although we can't enlarge the anonymous block device pool, at least we
can preallocate anon_dev for subvolume/snapshot in the first phase,
outside of transaction context and exactly at the moment the user calls
the creation ioctl.

Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:38 +02:00
Nikolay Borisov
e5b7231e20 btrfs: make btrfs_delalloc_reserve_space take btrfs_inode
All of its children take btrfs_inode so bubble up this requirement to
btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I
internally.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:36 +02:00
Nikolay Borisov
86d52921a2 btrfs: make btrfs_delalloc_release_space take btrfs_inode
It needs btrfs_inode so take it as a parameter directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:36 +02:00
Nikolay Borisov
c350437269 btrfs: make btrfs_lookup_ordered_extent take btrfs_inode
It doesn't use the generic vfs inode for anything use btrfs_inode
directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:25 +02:00
Waiman Long
b091f7fede btrfs: use kfree() in btrfs_ioctl_get_subvol_info()
In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc()
was incorrectly paired with kzfree(). According to David Sterba, there
isn't any sensitive information in the subvol_info that needs to be
cleared before freeing. So kzfree() isn't really needed, use kfree()
instead.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16 19:24:03 +02:00
David Sterba
0202e83fda btrfs: simplify iget helpers
The inode lookup starting at btrfs_iget takes the full location key,
while only the objectid is used to match the inode, because the lookup
happens inside the given root thus the inode number is unique.
The entire location key is properly set up in btrfs_init_locked_inode.

Simplify the helpers and pass only inode number, renaming it to 'ino'
instead of 'objectid'. This allows to remove temporary variables key,
saving some stack space.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:37 +02:00
David Sterba
56e9357a1e btrfs: simplify root lookup by id
The main function to lookup a root by its id btrfs_get_fs_root takes the
whole key, while only using the objectid. The value of offset is preset
to (u64)-1 but not actually used until btrfs_find_root that does the
actual search.

Switch btrfs_get_fs_root to use only objectid and remove all local
variables that existed just for the lookup. The actual key for search is
set up in btrfs_get_fs_root, reusing another key variable.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:36 +02:00
Robbie Ko
c11fbb6ed0 btrfs: reduce lock contention when creating snapshot
When creating a snapshot, ordered extents need to be flushed and this
can take a long time.

In create_snapshot there are two locks held when this happens:

  1. Destination directory inode lock
  2. Global subvolume semaphore

This will unnecessarily block other operations like subvolume destroy,
create, or setflag until the snapshot is created.

We can fix that by moving the flush outside the locked section as this
does not depend on the aforementioned locks.  The code factors out the
snapshot related work from create_snapshot to btrfs_mksnapshot.

__btrfs_ioctl_snap_create
  btrfs_mksubvol
    create_subvol
  btrfs_mksnapshot
    <flush>
    btrfs_mksubvol
      create_snapshot

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:36 +02:00
Qu Wenruo
92a7cc4252 btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE
The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.

In fact, that bit can only be set to those trees:

- Subvolume roots
- Data reloc root
- Reloc roots for above roots

All other trees won't get this bit set.  So just by the result, it is
obvious that, roots with this bit set can have tree blocks shared with
other trees.  Either shared by snapshots, or by reloc roots (an special
snapshot created by relocation).

This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
make it easier to understand, and update all comment mentioning
"reference counted" to follow the rename.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:35 +02:00
Nikolay Borisov
9babda9f33 btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot
With BTRFS_SUBVOL_CREATE_ASYNC support remove it's no longer required to
pass the async_transid parameter so remove it and any code using it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:02:00 +01:00
Nikolay Borisov
5d54c67ecc btrfs: Remove transid argument from btrfs_ioctl_snap_create_transid
btrfs_ioctl_snap_create_transid no longer takes a transid argument, so
remove it and rename the function to __btrfs_ioctl_snap_create to
reflect it's an internal, worker function.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:02:00 +01:00
Nikolay Borisov
9c1036fdb1 btrfs: Remove BTRFS_SUBVOL_CREATE_ASYNC support
This functionality was deprecated in kernel 5.4. Since no one has
complained of the impending removal it's time we did so.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:02:00 +01:00
Filipe Manana
6a17738100 Btrfs: move all reflink implementation code into its own file
The reflink code is quite large and has been living in ioctl.c since ever.
It has grown over the years after many bug fixes and improvements, and
since I'm planning on making some further improvements on it, it's time
to get it better organized by moving into its own file, reflink.c
(similar to what xfs does for example).

This change only moves the code out of ioctl.c into the new file, it
doesn't do any other change.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:54 +01:00
Andy Shevchenko
807fc790aa btrfs: switch to use new generic UUID API
There are new types and helpers that are supposed to be used in new code.

As a preparation to get rid of legacy types and API functions do
the conversion here.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:47 +01:00
Marcos Paulo de Souza
faf8f7b957 btrfs: ioctl: resize: only show message if size is changed
There is no point to inform the user about size change if there's none.
Update the message to conform to a commonly used format where the path
and devid are printed and also print old and new sizes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Marcos Paulo de Souza <marcos@mpdesouza.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance message ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:46 +01:00
Nikolay Borisov
dcc3eb9638 btrfs: convert snapshot/nocow exlcusion to drew lock
This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drew
lock to ensure this invariant still holds.

'Readers' are snapshot creators from create_snapshot and 'writers' are
nocow writers from buffered write path or btrfs_setsize. This locking
scheme allows for multiple snapshots to happen while any nocow writers
are blocked, since writes to page cache in the nocow path will make
snapshots inconsistent.

So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.

The main gain from using the drew lock is it's now a lot easier to
reason about the guarantees of the locking scheme and whether there is
some silent breakage lurking.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:44 +01:00
Marcos Paulo de Souza
949964c928 btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:

/
@subvol1/
@subvol2/
@subvol_default/

If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.

This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.

Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.

The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.

The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:42 +01:00
David Sterba
748449cdbe btrfs: use ioctl args support mask for device delete
When the device remove v2 ioctl was added, the full support mask was
added to sanity check the flags. However this would allow to let the
subvolume related flags to be accepted. This is not supposed to happen.

Use the correct support mask, which means that now any of
BTRFS_SUBVOL_CREATE_ASYNC, BTRFS_SUBVOL_RDONLY or
BTRFS_SUBVOL_QGROUP_INHERIT will be rejected as ENOTSUPP. Though this is
a user-visible change, specifying subvolume flags for device deletion
does not make sense and there are hopefully no applications doing that.

Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:42 +01:00
David Sterba
673990dba3 btrfs: use ioctl args support mask for subvolume create/delete
Using the defined mask instead of flag enumeration in the ioctl handler
is preferred. No functional changes.

Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:42 +01:00
Josef Bacik
0024652895 btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root
We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
bc44d7c4b2 btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root
Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
5119cfc36f btrfs: hold a ref on the root in create_pending_snapshot
We create the snapshot and then use it for a bunch of things, we need to
hold a ref on it while we're messing with it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:31 +01:00
Josef Bacik
2a2b5d6202 btrfs: hold ref on root in btrfs_ioctl_default_subvol
We look up an arbitrary fs root here, we need to hold a ref on the root
for the duration.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:29 +01:00
Josef Bacik
04734e8448 btrfs: hold a ref on the root in btrfs_ioctl_get_subvol_info
We look up whatever root userspace has given us, we need to hold a ref
throughout this operation. Use 'root' only for the on fs root and not as
a temporary variable elsewhere.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
b8a49ae191 btrfs: hold a ref on the root in btrfs_search_path_in_tree_user
We can wander into a different root, so grab a ref on the root we look
up.  Later on we make root = fs_info->tree_root so we need this separate
out label to make sure we do the right cleanup only in the case we're
looking up a different root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
88234012be btrfs: hold a ref on the root in btrfs_search_path_in_tree
We look up an arbitrary fs root, we need to hold a ref on it while we're
doing our search.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
3ca35e839e btrfs: hold a ref on the root in search_ioctl
We lookup a arbitrary fs root, we need to hold a ref on that root.  If
we're using our own inodes root then grab a ref on that as well to make
the cleanup easier.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
fc92f79856 btrfs: hold a ref on the root in create_subvol
We're creating the new root here, but we should hold the ref until after
we've initialized the inode for it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
3619c94f07 btrfs: open code btrfs_read_fs_root_no_name
All this does is call btrfs_get_fs_root() with check_ref == true.  Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:26 +01:00
Josef Bacik
d923afe96d btrfs: replace all uses of btrfs_ordered_update_i_size
Now that we have a safe way to update the i_size, replace all uses of
btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:24 +01:00
Josef Bacik
790a1d44f9 btrfs: use btrfs_ordered_update_i_size in clone_finish_inode_update
We were using btrfs_i_size_write(), which unconditionally jacks up
inode->disk_i_size.  However since clone can operate on ranges we could
have pending ordered extents for a range prior to the start of our clone
operation and thus increase disk_i_size too far and have a hole with no
file extent.

Fix this by using the btrfs_ordered_update_i_size helper which will do
the right thing in the face of pending ordered extents outside of our
clone range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:24 +01:00
Linus Torvalds
b5f7ab6b1c fs-dedupe-last-block-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4vEM0ACgkQxWXV+ddt
 WDvWFA//eHQZjWQ4FNby9AZwkDS4NIq2v6XGg7Xx9Ioxf1NenuBdyt1uNda2I3Mw
 4/0tEnrd3XL3POldx4TTZXlg0itfXeY2un8pGREPwbqGL67296wmENUT8kwXU8aX
 4zMuqO8veO7jzqeibX7T0VntnAC4Q+yWUd8Tb07KXaNH6Jeowo2rbU7n2CdI8T91
 IZdc/AxnmomVQ3Y0lmQe5fL5z9KRkwud7toXek8yVXyheX1JgnqkzVjCgLyk3ai7
 0VP0jfOPtC1qoMPR+buD2VuxaDTrLGkmsaG9l4XZghmmSnAvTMOxIosFA/SLqHwp
 LPDZ5RABbxMj1MHTAr2Bq+9uokqSIIdbSWFiMazeJsL/JWOiOQk0BekNJudeL7Ow
 PtGIZkKC+c3qKIjLymnd+u56klCe/O4vhx7wOHjoGxC0j+qRN473LK8dmGmgM9Yv
 XH5YfWqDpjt5h7SSei94S7JrqaJ29+gY8bMtMXJhx46jHh6AdXAkszU83kLr8Biz
 mvV49ECib12RuI4EVB4I3XedCftzMRH5ocN/Y3B2qGn/bU3g9nuExAc24FRBt6QK
 e/ucc4jQyRpbvpDwsmdE+kSxUvENuIRnlyke9XdOsN9Mth0X85gvYOS6aFS+4qjq
 QiXborlp+fZY00INYjG6Ck9UyfpdSEUJg452xVkGBhd/emAAIDo=
 =FGXQ
 -----END PGP SIGNATURE-----

Merge tag 'fs-dedupe-last-block-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull fs deduplication fix from David Sterba:
 "This is a fix for deduplication bug: the last block of two files is
  allowed to deduplicated. This got broken in 5.1 by lifting some
  generic checks to VFS layer. The affected filesystems are btrfs and
  xfs.

  The patches are marked for stable as the bug decreases deduplication
  effectivity"

* tag 'fs-dedupe-last-block-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: make deduplication with range including the last block work
  fs: allow deduplication of eof block into the end of the destination file
2020-01-28 15:18:23 -08:00
Filipe Manana
831d2fa25a Btrfs: make deduplication with range including the last block work
Since btrfs was migrated to use the generic VFS helpers for clone and
deduplication, it stopped allowing for the last block of a file to be
deduplicated when the source file size is not sector size aligned (when
eof is somewhere in the middle of the last block). There are two reasons
for that:

1) The generic code always rounds down, to a multiple of the block size,
   the range's length for deduplications. This means we end up never
   deduplicating the last block when the eof is not block size aligned,
   even for the safe case where the destination range's end offset matches
   the destination file's size. That rounding down operation is done at
   generic_remap_check_len();

2) Because of that, the btrfs specific code does not expect anymore any
   non-aligned range length's for deduplication and therefore does not
   work if such nona-aligned length is given.

This patch addresses that second part, and it depends on a patch that
fixes generic_remap_check_len(), in the VFS, which was submitted ealier
and has the following subject:

  "fs: allow deduplication of eof block into the end of the destination file"

These two patches address reports from users that started seeing lower
deduplication rates due to the last block never being deduplicated when
the file size is not aligned to the filesystem's block size.

Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-23 18:24:07 +01:00
Omar Sandoval
39b07b5d70 btrfs: drop create parameter to btrfs_get_extent()
We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.

While we're here, let's document btrfs_get_extent().

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20 16:40:55 +01:00
Filipe Manana
5afe6ce748 Btrfs: always copy scrub arguments back to user space
If scrub returns an error we are not copying back the scrub arguments
structure to user space. This prevents user space to know how much
progress scrub has done if an error happened - this includes -ECANCELED
which is returned when users ask for scrub to stop. A particular use
case, which is used in btrfs-progs, is to resume scrub after it is
canceled, in that case it relies on checking the progress from the scrub
arguments structure and then use that progress in a call to resume
scrub.

So fix this by always copying the scrub arguments structure to user
space, overwriting the value returned to user space with -EFAULT only if
copying the structure failed to let user space know that either that
copying did not happen, and therefore the structure is stale, or it
happened partially and the structure is probably not valid and corrupt
due to the partial copy.

Reported-by: Graham Cobb <g.btrfs@cobb.uk.net>
Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/
Fixes: 06fe39ab15 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Graham Cobb <g.btrfs@cobb.uk.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-17 15:28:52 +01:00
Josef Bacik
c7e54b5102 btrfs: abort transaction after failed inode updates in create_subvol
We can just abort the transaction here, and in fact do that for every
other failure in this function except these two cases.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:10:44 +01:00
Filipe Manana
147271e35b Btrfs: fix hole extent items with a zero size after range cloning
Normally when cloning a file range if we find an implicit hole at the end
of the range we assume it is because the NO_HOLES feature is enabled.
However that is not always the case. One well known case [1] is when we
have a power failure after mixing buffered and direct IO writes against
the same file.

In such cases we need to punch a hole in the destination file, and if
the NO_HOLES feature is not enabled, we need to insert explicit file
extent items to represent the hole. After commit 690a5dbfc5
("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning
extents"), we started to insert file extent items representing the hole
with an item size of 0, which is invalid and should be 53 bytes (the size
of a btrfs_file_extent_item structure), resulting in all sorts of
corruptions and invalid memory accesses. This is detected by the tree
checker when we attempt to write a leaf to disk.

The problem can be sporadically triggered by test case generic/561 from
fstests. That test case does not exercise power failure and creates a new
filesystem when it starts, so it does not use a filesystem created by any
previous test that tests power failure. However the test does both
buffered and direct IO writes (through fsstress) and it's precisely that
which is creating the implicit holes in files. That happens even before
the commit mentioned earlier. I need to investigate why we get those
implicit holes to check if there is a real problem or not. For now this
change fixes the regression of introducing file extent items with an item
size of 0 bytes.

Fix the issue by calling btrfs_punch_hole_range() without passing a
btrfs_clone_extent_info structure, which ensures file extent items are
inserted to represent the hole with a correct item size. We were passing
a btrfs_clone_extent_info with a value of 0 for its 'item_size' field,
which was causing the insertion of file extent items with an item size
of 0.

[1] https://www.spinics.net/lists/linux-btrfs/msg75350.html

Reported-by: David Sterba <dsterba@suse.com>
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13 14:10:28 +01:00
David Sterba
32da5386d9 btrfs: rename btrfs_block_group_cache
The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.

Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:51 +01:00
David Sterba
b3470b5dbe btrfs: add dedicated members for start and length of a block group
The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, although this
makes thing harder to understand. The key is also packed so the
offset/length is not properly aligned as u64.

Add start (key.objectid) and length (key.offset) members to block group
and remove the embedded key.  When the item is searched or written, a
local variable for key is used.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
bf38be65f3 btrfs: move block_group_item::used to block group
For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor helper. Let's unify it and make it a regular member and only
update the item before writing it to the tree.

The item is still being used for flags and chunk_objectid, there's some
duplication until the item is removed in following patches.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
Marcos Paulo de Souza
b929c1d831 btrfs: ioctl: Try to use btrfs_fs_info instead of *file
Some functions are doing some unnecessary indirection to reach the
btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info
struct instead of a *file.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:55 +01:00
Chengguang Xu
ce96b7ffd1 btrfs: use better definition of number of compression type
The compression type upper limit constant is the same as the last value
and this is confusing.  In order to keep coding style consistent, use
BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of
'NR' being one more than the last value.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:55 +01:00
David Sterba
e1f60a6580 btrfs: add __pure attribute to functions
The attribute is more relaxed than const and the functions could
dereference pointers, as long as the observable state is not changed. We
do have such functions, based on -Wsuggest-attribute=pure .

The visible effects of this patch are negligible, there are differences
in the assembly but hard to summarize.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:52 +01:00
David Sterba
4c66e0d424 btrfs: drop unused parameter is_new from btrfs_iget
The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:52 +01:00
David Sterba
a5009d3a31 btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
deprecated and scheduled for removal but we actualy do use them for
'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351e5
should have been just the async flag for subvolume creation.

The deprecation has been added in this development cycle, remove it
until it's time.

Fixes: ebc87351e5 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-04 21:42:01 +01:00
Qu Wenruo
8702ba9396 btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.

PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.

[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.

The most obvious one is:
In btrfs_buffered_write():
	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);

We always free qgroup PREALLOC meta space.

While in btrfs_truncate_block():
	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));

We only free qgroup PREALLOC meta space when something went wrong.

[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.

The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
  In btrfs_delalloc_reserve_metadata()

- Free the unused metadata space
  Normally in:
  btrfs_delalloc_release_extents()
  |- btrfs_inode_rsv_release()
     Here we do calculation on whether we should release or not.

E.g. for 64K buffered write, the metadata rsv works like:

/* The first page */
reserve_meta:	num_bytes=calc_inode_reservations()
free_meta:	num_bytes=0
total:		num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
   rsv */

/* The 2nd page */
reserve_meta:	num_bytes=calc_inode_reservations()
free_meta:	num_bytes=calc_inode_reservations()
total:		not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
   rsv, so we free what we have reserved */

/* The 3rd~16th pages */
reserve_meta:	num_bytes=calc_inode_reservations()
free_meta:	num_bytes=calc_inode_reservations()
total:		not changed (still space for one outstanding extent)

This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().

The good news is:
- The callers are not that hot
  The hottest caller is in btrfs_buffered_write(), which is already
  fixed by commit 336a8bb8e3 ("btrfs: Fix wrong
  btrfs_delalloc_release_extents parameter"). Thus it's not that
  easy to cause false EDQUOT.

- The trans commit in advance for qgroup would hide the bug
  Since commit f5fef45936 ("btrfs: qgroup: Make qgroup async transaction
  commit more aggressive"), when btrfs qgroup metadata free space is slow,
  it will try to commit transaction and free the wrongly converted
  PERTRANS space, so it's not that easy to hit such bug.

[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-15 18:50:07 +02:00
Omar Sandoval
e182163d9c btrfs: stop clearing EXTENT_DIRTY in inode I/O tree
Since commit fee187d9d9 ("Btrfs: do not set EXTENT_DIRTY along with
EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
can simplify and stop trying to clear it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:17 +02:00
Nikolay Borisov
ebc87351e5 btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag
Support for asynchronous snapshot creation was originally added in
72fd032e94 ("Btrfs: add SNAP_CREATE_ASYNC ioctl") to cater for
ceph's backend needs. However, since Ceph has deprecated support for
btrfs there is no longer need for that support in btrfs. Additionally,
this was never supported by btrfs-progs, the official userspace tools.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:14 +02:00
David Sterba
f10152bcc9 btrfs: sysfs: replace direct access to feature set names with a helper
In order to unexport the feature type array, add a helper for the
enum-to-string conversion.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:07 +02:00
Josef Bacik
aac0023c21 btrfs: move basic block_group definitions to their own header
This is prep work for moving all of the block group cache code into its
own file.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:03 +02:00
Filipe Manana
b64119b5f0 Btrfs: remove unnecessary condition in btrfs_clone() to avoid too much nesting
The bulk of the work done when cloning extents, at ioctl.c:btrfs_clone(),
is done inside an if statement that checks if the found key has the type
BTRFS_EXTENT_DATA_KEY. That if statement is redundant however, because
btrfs_search_slot() always leaves us in a leaf slot that points to a key
that is always greater then or equals to the search key, and our search
key here has that type, and because we bail out before that if statement
if the key at the given leaf slot is greater then BTRFS_EXTENT_DATA_KEY.

Therefore just remove that if statement, not only because it is useless
but mostly because it increases the nesting/indentation level in this
function which is quite big and makes things a bit awkward whenever I need
to fix something that requires changing btrfs_clone() (and it has been
like that for many years already).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:02 +02:00
Eric Sandeen
40cf931fa8 btrfs: use common vfs LABEL ioctl definitions
I lifted the btrfs label get/set ioctls to the vfs some time ago, but
never followed up to use those common definitions directly in btrfs.

This patch does that.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:58:59 +02:00
Filipe Manana
690a5dbfc5 Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.

In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.

A test case for fstests follows soon.

Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b21 ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:58:58 +02:00
Linus Torvalds
a18f877541 for-5.3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl0sNWYACgkQxWXV+ddt
 WDsyQA/8CGnF68g6hwVuYz4K7f39gOiFlBnRxeN/3RT6vkNSyLZxvRDaDrSTzVIo
 cz2G/9qZLXsIll+3EfZlyzZZiA+4f4hEDAfAd4yVPavRom+uu7dbqzAIpgvFlYdH
 vhAYKOeWSqWElWJ06hzWO3FCwjY9GKFMk4PS0XHHp+STCT0hq1MkaHr44kiHsqdh
 T5nVGDwXz8nGDZ51RO6+mgiSrd5eHbs6kXCd8rW7hmjTx8ClKHa1tdkxN/us+pJm
 hTFT669m5ckHhY2AUKmkREoOwpnt2HcXQJNkz6gO+o03IDvYz73SScbhSYdNTlwi
 j74GLf89FA52qVM+JDg9MaWYqgf1pQI8AHK/rXw2FNbuP/eL9kuZ85ZIbO6CiO0c
 5jAixReSwzSP/V0+MKW3F7k4KtIqbHAV6mkI8zLwrAee4Xj81BOtgL7gYPFQTwSZ
 ma0hEoen7IV5+/z9upUuLA5wr4BT+h1T+EllCWe1+9+9mRYOvowtkRNBL8HZWTDI
 b65oTITfot54xX9ecKtiuG2qoqJEjjkR+YKdRM4nph6wflSNZxEoezBp3iRFpYOL
 Lx+g97RcJ2EEoBVjVMkTqfj93GeiKRifa8yXdRY+A0I2ZXZEcS8DjSJM6rj3AOPy
 4idIl+ABscayZowfqu0FSIULf1La0qiRXmbGNeG4ylhN4L6S/og=
 =eshk
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Highlights:

   - chunks that have been trimmed and unchanged since last mount are
     tracked and skipped on repeated trims

   - use hw assissed crc32c on more arches, speedups if native
     instructions or optimized implementation is available

   - the RAID56 incompat bit is automatically removed when the last
     block group of that type is removed

  Fixes:

   - fsync fix for reflink on NODATACOW files that could lead to ENOSPC

   - fix data loss after inode eviction, renaming it, and fsync it

   - fix fsync not persisting dentry deletions due to inode evictions

   - update ctime/mtime/iversion after hole punching

   - fix compression type validation (reported by KASAN)

   - send won't be allowed to start when relocation is in progress, this
     can cause spurious errors or produce incorrect send stream

  Core:

   - new tracepoints for space update

   - tree-checker: better check for end of extents for some tree items

   - preparatory work for more checksum algorithms

   - run delayed iput at unlink time and don't push the work to cleaner
     thread where it's not properly throttled

   - wrap block mapping to structures and helpers, base for further
     refactoring

   - split large files, part 1:
       - space info handling
       - block group reservations
       - delayed refs
       - delayed allocation

   - other cleanups and refactoring"

* tag 'for-5.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (103 commits)
  btrfs: fix memory leak of path on error return path
  btrfs: move the subvolume reservation stuff out of extent-tree.c
  btrfs: migrate the delalloc space stuff to it's own home
  btrfs: migrate btrfs_trans_release_chunk_metadata
  btrfs: migrate the delayed refs rsv code
  btrfs: Evaluate io_tree in find_lock_delalloc_range()
  btrfs: migrate the global_block_rsv helpers to block-rsv.c
  btrfs: migrate the block-rsv code to block-rsv.c
  btrfs: stop using block_rsv_release_bytes everywhere
  btrfs: cleanup the target logic in __btrfs_block_rsv_release
  btrfs: export __btrfs_block_rsv_release
  btrfs: export btrfs_block_rsv_add_bytes
  btrfs: move btrfs_block_rsv definitions into it's own header
  btrfs: Simplify update of space_info in __reserve_metadata_bytes()
  btrfs: unexport can_overcommit
  btrfs: move reserve_metadata_bytes and supporting code to space-info.c
  btrfs: move dump_space_info to space-info.c
  btrfs: export block_rsv_use_bytes
  btrfs: move btrfs_space_info_add_*_bytes to space-info.c
  btrfs: move the space info update macro to space-info.h
  ...
2019-07-16 15:12:56 -07:00
Linus Torvalds
5010fe9f09 New for 5.3:
- Standardize parameter checking for the SETFLAGS and FSSETXATTR ioctls
   (which were the file attribute setters for ext4 and xfs and have now
   been hoisted to the vfs)
 - Only allow the DAX flag to be set on files and directories.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0aJgMACgkQ+H93GTRK
 tOuKkg//SJaxcB63uVPZk9hDraYTmyo9OXRRX6X9WwDKPTWwa88CUwS1ny1QF7Mt
 zMkgzG2/y2Rs9PQ0ARoPbh1hNb2CXnvA+xnzUEev1MW6UN/nTFMZEOPn2ZQ+DxQE
 gg/0U56kKgtjtXzBZVpTgHzSETivdXwHxFW3hiTtyRXg+4ulgDIZLOjN2wRB+Pdb
 X8ZmM6MqKOTbhQEXlw13TlCKBzoMjC1w4UU4rkZPjoSjAaUWiPfrk/XU7qgguf9p
 v1dbSN2dADQ19jzZ1dmggXnlJsRMZjk/ls5rxJlB5DHDbh6YgnA2TE+tYrtH28eB
 uyKfD+RQnMzRVdmH8PsMQRQQFXR2UYyprVP7a6wi6TkB+gytn7sR5uT4sbAhmhcF
 TiTYfYNRXzemHCewyOwOsUE/7oCeiJcdbqiPAHHD/jYLZfRjSXDcGzz3+7ZYZ3GO
 hRxUhpxHPbkmK4T2OxhzReCbRsLN/0BeEcDdLkNWmi2FTh3V1gYzMGkgI9wsVbsd
 pHjoGIHbMPWqktF/obuGq96WVfYBBaWJ6WNzQqKT4dQYAJBW2omxitXQHLpi6cjt
 hG5ncxa3cPpWx4t3Lx2hb0TPS7RyYvuoQIcS/Me2RWioxrwWrgnOqdHFfLEwWpfN
 jRowdWiGgOIsq8hMt7qycmGCXzbgsbaA/7oRqh8TiwM9taPOM4c=
 =uH2E
 -----END PGP SIGNATURE-----

Merge tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong:
 "Here's a patch series that sets up common parameter checking functions
  for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations.

  The goal here is to reduce the amount of behaviorial variance between
  the filesystems where those ioctls originated (ext2 and XFS,
  respectively) and everybody else.

   - Standardize parameter checking for the SETFLAGS and FSSETXATTR
     ioctls (which were the file attribute setters for ext4 and xfs and
     have now been hoisted to the vfs)

   - Only allow the DAX flag to be set on files and directories"

* tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  vfs: only allow FSSETXATTR to set DAX flag on files and dirs
  vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
  vfs: teach vfs_ioc_fssetxattr_check to check project id info
  vfs: create a generic checking function for FS_IOC_FSSETXATTR
  vfs: create a generic checking and prep function for FS_IOC_SETFLAGS
2019-07-12 16:54:37 -07:00
Linus Torvalds
e6983afd92 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl0kWgwACgkQnJ2qBz9k
 QNkkdggA7bdy6xZRPdumZMxtASGDs1JJ4diNs+apgyc6wUfsT1lCE2ap20EdzfzK
 drAvlJt1vYEW+6apOzUXJ0qWXMVRzy4XRl+jVMO9GW6BoY4OyJQ86AQZlEv1zZ4n
 vxeYnlbxA7JyfkWgup0ZSb5EKRSO1eSxZKEZou0wu2jRCRr/E5RyjPQHXaiE5ihc
 7ilEtTI3Qg3nnAK30F0Iy0X3lGqgXj+rlJ0TgR8BBEDllct2wV16vvMl/Sy+BXip
 5sSWjSy8zntMnkSN8yH/oJN0D+fqmCsnYafwqTpPek8izvEz4xpjshbWTDnPm0HM
 eiMC1U3ZJoD3Z4/wxRZ91m60VYgJBA==
 =SVKR
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "This contains cleanups of the fsnotify name removal hook and also a
  patch to disable fanotify permission events for 'proc' filesystem"

* tag 'fsnotify_for_v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: get rid of fsnotify_nameremove()
  fsnotify: move fsnotify_nameremove() hook out of d_delete()
  configfs: call fsnotify_rmdir() hook
  debugfs: call fsnotify_{unlink,rmdir}() hooks
  debugfs: simplify __debugfs_remove_file()
  devpts: call fsnotify_unlink() hook
  tracefs: call fsnotify_{unlink,rmdir}() hooks
  rpc_pipefs: call fsnotify_{unlink,rmdir}() hooks
  btrfs: call fsnotify_rmdir() hook
  fsnotify: add empty fsnotify_{unlink,rmdir}() hooks
  fanotify: Disallow permission events for proc filesystem
2019-07-10 20:09:17 -07:00
Josef Bacik
867363429d btrfs: migrate the delalloc space stuff to it's own home
We have code for data and metadata reservations for delalloc.  There's
quite a bit of code here, and it's used in a lot of places so I've
separated it out to it's own file.  inode.c and file.c are already
pretty large, and this code is complicated enough to live in its own
space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04 17:26:17 +02:00
Josef Bacik
8719aaae8d btrfs: move space_info to space-info.h
Migrate the struct definition and the one helper that's in ctree.h into
space-info.h

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:51 +02:00
Darrick J. Wong
7b0e492e6b vfs: create a generic checking function for FS_IOC_FSSETXATTR
Create a generic checking function for the incoming FS_IOC_FSSETXATTR
fsxattr values so that we can standardize some of the implementation
behaviors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-07-01 08:25:35 -07:00
Darrick J. Wong
5aca284210 vfs: create a generic checking and prep function for FS_IOC_SETFLAGS
Create a generic function to check incoming FS_IOC_SETFLAGS flag values
and later prepare the inode for updates so that we can standardize the
implementations that follow ext4's flag values.

Note that the efivarfs implementation no longer fails a no-op SETFLAGS
without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2019-07-01 08:25:34 -07:00
Qu Wenruo
a94d1d0cb3 btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to COW without data reservation
[BUG]
The following script can cause unexpected fsync failure:

  #!/bin/bash

  dev=/dev/test/test
  mnt=/mnt/btrfs

  mkfs.btrfs -f $dev -b 512M > /dev/null
  mount $dev $mnt -o nospace_cache

  # Prealloc one extent
  xfs_io -f -c "falloc 8k 64m" $mnt/file1
  # Fill the remaining data space
  xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
  sync

  # Write into the prealloc extent
  xfs_io -c "pwrite 1m 16m" $mnt/file1

  # Reflink then fsync, fsync would fail due to ENOSPC
  xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
  umount $dev

The fsync fails with ENOSPC, and the last page of the buffered write is
lost.

[CAUSE]
This is caused by:
- Btrfs' back reference only has extent level granularity
  So write into shared extent must be COWed even only part of the extent
  is shared.

So for above script we have:
- fallocate
  Create a preallocated extent where we can do NOCOW write.

- fill all the remaining data and unallocated space

- buffered write into preallocated space
  As we have not enough space available for data and the extent is not
  shared (yet) we fall into NOCOW mode.

- reflink
  Now part of the large preallocated extent is shared, later write
  into that extent must be COWed.

- fsync triggers writeback
  But now the extent is shared and therefore we must fallback into COW
  mode, which fails with ENOSPC since there's not enough space to
  allocate data extents.

[WORKAROUND]
The workaround is to ensure any buffered write in the related extents
(not just the reflink source range) get flushed before reflink/dedupe,
so that NOCOW writes succeed that happened before reflinking succeed.

The workaround is expensive, we could do it better by only flushing
NOCOW range, but that needs extra accounting for NOCOW range.
For now, fix the possible data loss first.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:00 +02:00
Amir Goldstein
46008d9d3f btrfs: call fsnotify_rmdir() hook
This will allow generating fsnotify delete events after the
fsnotify_nameremove() hook is removed from d_delete().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-06-20 14:45:04 +02:00
Filipe Manana
3763771cf6 Btrfs: fix failure to persist compression property xattr deletion on fsync
After the recent series of cleanups in the properties and xattrs modules
that landed in the 5.2 merge window, we ended up with a regression where
after deleting the compression xattr property through the setflags ioctl,
we don't set the BTRFS_INODE_COPY_EVERYTHING flag in the inode anymore.
As a consequence, if the inode was fsync'ed when it had the compression
property set, after deleting the compression property through the setflags
ioctl and fsync'ing again the inode, the log will still contain the
compression xattr, because the inode did not had that bit set, which
made the fsync not delete all xattrs from the log and copy all xattrs
from the subvolume tree to the log tree.

This regression happens due to the fact that that series of cleanups
made btrfs_set_prop() call the old function do_setxattr() (which is now
named btrfs_setxattr()), and not the old version of btrfs_setxattr(),
which is now called btrfs_setxattr_trans().

Fix this by setting the BTRFS_INODE_COPY_EVERYTHING bit in the current
btrfs_setxattr() function and remove it from everywhere else, including
its setup at btrfs_ioctl_setflags(). This is cleaner, avoids similar
regressions in the future, and centralizes the setup of the bit. After
all, the need to setup this bit should only be in the xattrs module,
since it is an implementation of xattrs.

Fixes: 04e6863b19 ("btrfs: split btrfs_setxattr calls regarding transaction")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-06-17 16:37:17 +02:00
Anand Jain
44e5194b5e btrfs: drop local copy of inode i_mode
There isn't real use of making struct inode::i_mode a local copy, it
saves a dereference one time, not much. Just use it directly.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:53 +02:00
Anand Jain
3c8d8b6357 btrfs: drop old_fsflags in btrfs_ioctl_setflags
btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only
once. Instead used it directly.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:53 +02:00
Anand Jain
d2b8fcfe43 btrfs: modify local copy of btrfs_inode flags
Instead of updating the binode::flags directly, update a local copy, and
then at the point of no error, store copy it to the binode::flags.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:53 +02:00
Anand Jain
11d3cd5c62 btrfs: drop useless inode i_flags copy and restore
The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used
btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the
inode::i_flags update functions such as
btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called
in btrfs_ioctl_setflags() instead of
btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the
inode::i_flags remains unmodified until the thread has checked all the
conditions. So drop the saved inode::i_flags in out_i_flags.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:53 +02:00
Anand Jain
ff9fef559b btrfs: start transaction in btrfs_ioctl_setflags()
Inode attribute can be set through the FS_IOC_SETFLAGS ioctl.  This
flags also includes compression attribute for which we would set/reset
the compression extended attribute. While doing this there is a bit of
duplicate code, the following things happens twice:

- start/end_transaction
- inode_inc_iversion()
- current_time update to inode->i_ctime
- and btrfs_update_inode()

These are updated both at btrfs_ioctl_setflags() and btrfs_set_props()
as well.  This patch merges these two duplicate codes at
btrfs_ioctl_setflags().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:53 +02:00
Anand Jain
f22125e5d8 btrfs: refactor btrfs_set_props to validate externally
In preparation to merge multiple transactions when setting the
compression flags, split btrfs_set_props() validation part outside of
it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:52 +02:00
Filipe Manana
62d54f3a7f Btrfs: fix race between send and deduplication that lead to failures and crashes
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:

1) When it was introduced, no operations were allowed to modifiy read-only
   subvolumes/snapshots (including defrag for example).

2) It keeps send from having an impact on other filesystem operations.
   Namely send does not need to keep locks on the trees nor needs to hold on
   to transaction handles and delay transaction commits. This ends up being
   a consequence of the former reason.

However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).

That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].

The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:

 [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
 [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27

The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:

 [1631742.130080] ------------[ cut here ]------------
 [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
 [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
 [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G    B D W         5.0.0-rc8-btrfs-next-45 #1
 [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
 [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
 (...)
 [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
 [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
 [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
 [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
 [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
 [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
 [1631742.138455] FS:  00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
 [1631742.139010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
 [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [1631742.141272] Call Trace:
 [1631742.141826]  ? do_raw_spin_unlock+0x49/0xc0
 [1631742.142390]  tree_advance+0x173/0x1d0 [btrfs]
 [1631742.142948]  btrfs_compare_trees+0x268/0x690 [btrfs]
 [1631742.143533]  ? process_extent+0x1070/0x1070 [btrfs]
 [1631742.144088]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
 [1631742.144645]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
 [1631742.145161]  ? trace_sched_stick_numa+0xe0/0xe0
 [1631742.145685]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
 [1631742.146179]  ? account_entity_enqueue+0xd3/0x100
 [1631742.146662]  ? reweight_entity+0x154/0x1a0
 [1631742.147135]  ? update_curr+0x20/0x2a0
 [1631742.147593]  ? check_preempt_wakeup+0x103/0x250
 [1631742.148053]  ? do_vfs_ioctl+0xa2/0x6f0
 [1631742.148510]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
 [1631742.148942]  do_vfs_ioctl+0xa2/0x6f0
 [1631742.149361]  ? __fget+0x113/0x200
 [1631742.149767]  ksys_ioctl+0x70/0x80
 [1631742.150159]  __x64_sys_ioctl+0x16/0x20
 [1631742.150543]  do_syscall_64+0x60/0x1b0
 [1631742.150931]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [1631742.151326] RIP: 0033:0x7f4cd9f5add7
 (...)
 [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
 [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
 [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
 [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
 [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
 [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
 (...)
 [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---

That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).

Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:52 +02:00
Qu Wenruo
82fa113fcc btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()
Use the new btrfs_ref structure and replace parameter list to clean up
the usage of owner and level to distinguish the extent types.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:49 +02:00
Goldwyn Rodrigues
7984ae52bb btrfs: Perform locking/unlocking in btrfs_remap_file_range()
Move code to make it more readable, so as locking and unlocking is
done in the same function. The generic checks that are now performed in
the locked section are unaffected.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:29 +02:00
Anand Jain
262c96a3c3 btrfs: refactor btrfs_set_prop and add btrfs_set_prop_trans
btrfs_set_prop() takes transaction pointer as the first argument,
however in ioctl.c for the purpose of setting the compression property,
we call btrfs_set_prop() with NULL transaction pointer. Down in
the call chain  btrfs_setxattr() starts transaction to update the
attribute and also to update the inode.

So for clarity, create btrfs_set_prop_trans() with no transaction
pointer as argument, in preparation to start transaction here instead of
doing it down the call chain at btrfs_setxattr().

Also now the btrfs_set_prop() is a static function.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:19 +02:00
Anand Jain
7715da84f7 btrfs: merge _btrfs_set_prop helpers
btrfs_set_prop() is a redirect to __btrfs_set_prop() with the
transaction handle equal to NULL.  __btrfs_set_prop() in turn passes
this to do_setxattr() which then transaction is actually created.

Instead merge  __btrfs_set_prop() to btrfs_set_prop(), and update the
caller with NULL argument.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:19 +02:00
Filipe Manana
f35f06c355 Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
Whan a filesystem is mounted with the nologreplay mount option, which
requires it to be mounted in RO mode as well, we can not allow discard on
free space inside block groups, because log trees refer to extents that
are not pinned in a block group's free space cache (pinning the extents is
precisely the first phase of replaying a log tree).

So do not allow the fitrim ioctl to do anything when the filesystem is
mounted with the nologreplay option, because later it can be mounted RW
without that option, which causes log replay to happen and result in
either a failure to replay the log trees (leading to a mount failure), a
crash or some silent corruption.

Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Fixes: 96da09192c ("btrfs: Introduce new mount option to disable tree log replay")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-28 18:10:27 +01:00
Filipe Manana
4ea748e1d2 Btrfs: fix deadlock between clone/dedupe and rename
Reflinking (clone/dedupe) and rename are operations that operate on two
inodes and therefore need to lock them in the same order to avoid ABBA
deadlocks. It happens that Btrfs' reflink implementation always locked
them in a different order from VFS's lock_two_nondirectories() helper,
which is used by the rename code in VFS, resulting in ABBA type deadlocks.

Btrfs' locking order:

  static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2)
  {
         if (inode1 < inode2)
                swap(inode1, inode2);

         inode_lock_nested(inode1, I_MUTEX_PARENT);
         inode_lock_nested(inode2, I_MUTEX_CHILD);
  }

VFS's locking order:

  void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
  {
        if (inode1 > inode2)
                swap(inode1, inode2);

        if (inode1 && !S_ISDIR(inode1->i_mode))
                inode_lock(inode1);
        if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1)
                inode_lock_nested(inode2, I_MUTEX_NONDIR2);
}

Fix this by killing the btrfs helper function that does the double inode
locking and replace it with VFS's helper lock_two_nondirectories().

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 416161db9b ("btrfs: offline dedupe")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-27 12:24:16 +01:00
Filipe Manana
57a50e2506 Btrfs: remove no longer needed range length checks for deduplication
Comparing the content of the pages in the range to deduplicate is now
done in generic_remap_checks called by the generic helper
generic_remap_file_range_prep(), which takes care of ensuring we do not
compare/deduplicate undefined data beyond a file's EOF (range from EOF
to the next block boundary). So remove these checks which are now
redundant.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:40 +01:00
Anand Jain
09ba3bc9dd btrfs: merge btrfs_find_device and find_device
Both btrfs_find_device() and find_device() does the same thing except
that the latter does not take the seed device onto account in the device
scanning context. We can merge them.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:24 +01:00
Anand Jain
e4319cd9ca btrfs: refactor btrfs_find_device() take fs_devices as argument
btrfs_find_device() accepts fs_info as an argument and retrieves
fs_devices from fs_info.

Instead use fs_devices, so that this function can be used in non-mount
(during device scanning) context as well.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:23 +01:00
Filipe Manana
500710d3b8 Btrfs: move duplicated nodatasum check into common reflink/dedupe helper
Move the check that verifies if both inodes have checksums disabled or
both have them enabled, from the clone and deduplication functions into
the new common helper btrfs_remap_file_range_prep().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:21 +01:00
Filipe Manana
d00c2d9c76 Btrfs: do not overwrite error return value in the balance ioctl
If the call to btrfs_balance() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_balance()
returned success or was canceled.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:21 +01:00
Filipe Manana
d3a53286c1 Btrfs: do not overwrite error return value in the device replace ioctl
If the call to btrfs_dev_replace_by_ioctl() failed we would overwrite the
error returned to user space with -EFAULT if the call to copy_to_user()
failed as well. Fix that by calling copy_to_user() only if no error
happened before or a device replace operation was canceled.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:20 +01:00
Filipe Manana
0f39b60563 Btrfs: remove redundant check for swapfiles when reflinking
Checking if either of the inodes corresponds to a swapfile is already
performed by generic_remap_file_range_prep(), so we do not need to do
it in the btrfs clone and deduplication functions.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:20 +01:00
YueHaibing
aa704d4e75 btrfs: remove set but not used variable 'num_pages'
Fixes gcc '-Wunused-but-set-variable' warning:

fs/btrfs/ioctl.c: In function 'btrfs_extent_same':
fs/btrfs/ioctl.c:3260:6: warning:
 variable 'num_pages' set but not used [-Wunused-but-set-variable]

It not used any more since commit 9ee8234e6220 ("Btrfs: use
generic_remap_file_range_prep() for cloning and deduplication")

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:18 +01:00