linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-12-03 09:04:21 +08:00

Author	SHA1	Message	Date
Qu Wenruo	eb2fad3005	btrfs: scrub: remove scrub_sector::page and use scrub_block::pages instead Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases, it will allocate one page for each scrub_sector, which can cause extra unnecessary memory usage. Utilize scrub_block::pages[] instead of allocating page for each scrub_sector, this allows us to integrate larger extents while using less memory. For example, if our page size is 64K, sectorsize is 4K, and we got an 32K sized extent. We will only allocate one page for scrub_block, and all 8 scrub sectors will point to that page. To do that properly, here we introduce several small helpers: - scrub_page_get_logical() Get the logical bytenr of a page. We store the logical bytenr of the page range into page::private. But for 32bit systems, their (void *) is not large enough to contain a u64, so in that case we will need to allocate extra memory for it. For 64bit systems, we can use page::private directly. - scrub_block_get_logical() Just get the logical bytenr of the first page. - scrub_sector_get_page() Return the page which the scrub_sector points to. - scrub_sector_get_page_offset() Return the offset inside the page which the scrub_sector points to. - scrub_sector_get_kaddr() Return the address which the scrub_sector points to. Just a wrapper using scrub_sector_get_page() and scrub_sector_get_page_offset() - bio_add_scrub_sector() Please note that, even with this patch, we're still allocating one page for one sector for data extents. This is because in scrub_extent() we split the data extent using sectorsize. The memory usage reduction will need extra work to make scrub to work like data read to only use the correct sector(s). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	f3e01e0e3c	btrfs: scrub: introduce scrub_block::pages for more efficient memory usage for subpage [BACKGROUND] Currently for scrub, we allocate one page for one sector, this is fine for PAGE_SIZE == sectorsize support, but can waste extra memory for subpage support. [CODE CHANGE] Make scrub_block contain all the pages, so if we're scrubbing an extent sized 64K, and our page size is also 64K, we only need to allocate one page. [LIFESPAN CHANGE] Since now scrub_sector no longer holds a page, but is using scrub_block::pages[] instead, we have to ensure scrub_block has a longer lifespan for write bio. The lifespan for read bio is already large enough. Now scrub_block will only be released after the write bio finished. [COMING NEXT] Currently we only added scrub_block::pages[] for this purpose, but scrub_sector is still utilizing the old scrub_sector::page. The switch will happen in the next patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	5dd3d8e468	btrfs: scrub: factor out allocation and initialization of scrub_sector into helper The allocation and initialization is shared by 3 call sites, and we're going to change the initialization of some members in the upcoming patches. So factor out the allocation and initialization of scrub_sector into a helper, alloc_scrub_sector(), which will do the following work: - Allocate the memory for scrub_sector - Allocate a page for scrub_sector::page - Initialize scrub_sector::refs to 1 - Attach the allocated scrub_sector to scrub_block The attachment is bidirectional, which means scrub_block::sectorv[] will be updated and scrub_sector::sblock will also be updated. - Update scrub_block::sector_count and do extra sanity check on it Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	15b88f6d24	btrfs: scrub: factor out initialization of scrub_block into helper Although there are only two callers, we are going to add some members for scrub_block in the incoming patches. Factoring out the initialization code will make later expansion easier. One thing to note is, even scrub_handle_errored_block() doesn't utilize scrub_block::refs, we still use alloc_scrub_block() to initialize sblock::ref, allowing us to use scrub_block_put() to do cleanup. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	1dfa500511	btrfs: scrub: use pointer array to replace sblocks_for_recheck In function scrub_handle_errored_block(), we use @sblocks_for_recheck pointer to hold one scrub_block for each mirror, and uses kcalloc() to allocate an array. But this one pointer for an array is not readable due to the member offsets done by addition and not []. Change this pointer to struct scrub_block *[BTRFS_MAX_MIRRORS], this will slightly increase the stack memory usage. Since function scrub_handle_errored_block() won't get iterative calls, this extra cost would completely be acceptable. And since we're here, also set sblock->refs and use scrub_block_put() to clean them up, as later we will add extra members in scrub_block, which needs scrub_block_put() to clean them up. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Boris Burkov	38622010a6	btrfs: send: add support for fs-verity Preserve the fs-verity status of a btrfs file across send/recv. There is no facility for installing the Merkle tree contents directly on the receiving filesystem, so we package up the parameters used to enable verity found in the verity descriptor. This gives the receive side enough information to properly enable verity again. Note that this means that receive will have to re-compute the whole Merkle tree, similar to how compression worked before encoded_write. Since the file becomes read-only after verity is enabled, it is important that verity is added to the send stream after any file writes. Therefore, when we process a verity item, merely note that it happened, then actually create the command in the send stream during 'finish_inode_if_needed'. This also creates V3 of the send stream format, without any format changes besides adding the new commands and attributes. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Uros Bizjak	e5677f0560	btrfs: use atomic_try_cmpxchg in free_extent_buffer Use `atomic_try_cmpxchg(ptr, &old, new)` instead of `atomic_cmpxchg(ptr, old, new) == old` in free_extent_buffer. This has two benefits: - The x86 cmpxchg instruction returns success in the ZF flag, so this change saves a compare after cmpxchg, as well as a related move instruction in the front of cmpxchg. - atomic_try_cmpxchg implicitly assigns the *ptr value to &old when cmpxchg fails, enabling further code simplifications. This patch has no functional change. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Qu Wenruo	fc65bb5318	btrfs: scrub: remove impossible sanity checks There are several sanity checks which are no longer possible to trigger inside btrfs_scrub_dev(). Since we have mount time check against super block nodesize/sectorsize, and our fixed macro is hardcoded to handle even the worst combination. Thus those sanity checks are no longer needed, can be easily removed. But this patch still uses some ASSERT()s as a safe net just in case we change some features in the future to trigger those impossible combinations. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Josef Bacik	527c490f44	btrfs: delete btrfs_wait_space_cache_v1_finished We used to use this in a few spots, but now we only use it directly inside of block-group.c, so remove the helper and just open code where we were using it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:55 +02:00
Josef Bacik	588a486835	btrfs: remove lock protection for BLOCK_GROUP_FLAG_RELOCATING_REPAIR Before when this was modifying the bit field we had to protect it with the bg->lock, however now we're using bit helpers so we can stop using the bg->lock. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	7b9c293b05	btrfs: remove BLOCK_GROUP_FLAG_HAS_CACHING_CTL This is used mostly to determine if we need to look at the caching ctl list and clean up any references to this block group. However we never clear this flag, specifically because we need to know if we have to remove a caching ctl we have for this block group still. This is in the remove block group path which isn't a fast path, so the optimization doesn't really matter, simplify this logic and remove the flag. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	50c31eaa4c	btrfs: simplify block group traversal in btrfs_put_block_group_cache We're breaking out and re-searching for the next block group while evicting any of the block group cache inodes. This is not needed, the block groups aren't disappearing here, we can simply loop through the block groups like normal and iput any inode that we find. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	9283b9e09a	btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY We use this during device replace for zoned devices, we were simply taking the lock because it was in a bit field and we needed the lock to be safe with other modifications in the bitfield. With the bit helpers we no longer require that locking. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	3349b57fd4	btrfs: convert block group bit field to use bit helpers We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	723de71d41	btrfs: handle space_info setting of bg in btrfs_add_bg_to_space_info We previously had the pattern of btrfs_update_space_info(all, the, bg, fields, &space_info); link_block_group(bg); bg->space_info = space_info; Now that we're passing the bg into btrfs_add_bg_to_space_info we can do the linking in that function, transforming this to simply btrfs_add_bg_to_space_info(fs_info, bg); and put the link_block_group() and bg->space_info assignment directly in btrfs_add_bg_to_space_info. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	9d4b0a129a	btrfs: simplify arguments of btrfs_update_space_info and rename This function has grown a bunch of new arguments, and it just boils down to passing in all the block group fields as arguments. Simplify this by passing in the block group itself and updating the space_info fields based on the block group fields directly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Josef Bacik	2f12741f81	btrfs: use btrfs_fs_closing for background bg work For both unused bg deletion and async balance work we'll happily run if the fs is closing. However I want to move these to their own worker thread, and they can be long running jobs, so add a check to see if we're closing and simply bail. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
Omar Sandoval	d1f68ba069	btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent() btrfs_insert_file_extent() is only ever used to insert holes, so rename it and remove the redundant parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:54 +02:00
David Sterba	7f298f224e	btrfs: sysfs: use sysfs_streq for string matching We have own string matching helper that duplicates what sysfs_streq does, with a slight difference that it skips initial whitespace. So far this is used for the drive allocation policy. The initial whitespace of written sysfs values should be rather discouraged and we should use a standard helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Qu Wenruo	f9eab5f0bb	btrfs: scrub: try to fix super block errors [BUG] The following script shows that, although scrub can detect super block errors, it never tries to fix it: mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2 xfs_io -c "pwrite 67108864 4k" $dev2 mount $dev1 $mnt btrfs scrub start -B $dev2 btrfs scrub start -Br $dev2 umount $mnt The first scrub reports the super error correctly: scrub done for f3289218-abd3-41ac-a630-202f766c0859 Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 But the second read-only scrub still reports the same super error: Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 [CAUSE] The comments already shows that super block can be easily fixed by committing a transaction: /* * If we find an error in a super block, we just report it. * They will get written with the next transaction commit * anyway */ But the truth is, such assumption is not always true, and since scrub should try to repair every error it found (except for read-only scrub), we should really actively commit a transaction to fix this. [FIX] Just commit a transaction if we found any super block errors, after everything else is done. We cannot do this just after scrub_supers(), as btrfs_commit_transaction() will try to pause and wait for the running scrub, thus we can not call it with scrub_lock hold. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Qu Wenruo	e69bf81c9a	btrfs: scrub: properly report super block errors in system log [PROBLEM] Unlike data/metadata corruption, if scrub detected some error in the super block, the only error message is from the updated device status: BTRFS info (device dm-1): scrub: started on devid 2 BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0 This is not helpful at all. [CAUSE] Unlike data/metadata error reporting, there is no visible report in kernel dmesg to report supper block errors. In fact, return value of scrub_checksum_super() is intentionally skipped, thus scrub_handle_errored_block() will never be called for super blocks. [FIX] Make super block errors to output an error message, now the full dmesg would looks like this: BTRFS info (device dm-1): scrub: started on devid 2 BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864 BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0 BTRFS info (device dm-1): scrub: started on devid 2 This fix involves: - Move the super_errors reporting to scrub_handle_errored_block() This allows the device status message to show after the super block error message. But now we no longer distinguish super block corruption and generation mismatch, now all counted as corruption. - Properly check the return value from scrub_checksum_super() - Add extra super block error reporting for scrub_print_warning(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Alexander Zhu	b0c582233a	btrfs: fix alignment of VMA for memory mapped files on THP With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for read-only mmapped files, such as shared libraries. However, the kernel makes no attempt to actually align those mappings on 2MB boundaries, which makes it impossible to use those THPs most of the time. This issue applies to general file mapping THP as well as existing setups using CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using thp_get_unmapped_area for the unmapped_area function in btrfs, which is what ext2, ext4, fuse, and xfs all use. Initially btrfs had been left out in commit 8c07fc452ac0 ("btrfs: fix alignment of VMA for memory mapped files on THP") as btrfs does not support DAX. However, commit `1854bc6e24` ("mm/readahead: Align file mappings for non-DAX") removed the DAX requirement. We should now be able to call thp_get_unmapped_area() for btrfs. The problem can be seen in /proc/PID/smaps where THPeligible is set to 0 on mappings to eligible shared object files as shown below. Before this patch: 7fc6a7e18000-7fc6a80cc000 r-xp 00000000 00:1e 199856 /usr/lib64/libcrypto.so.1.1.1k Size: 2768 kB THPeligible: 0 VmFlags: rd ex mr mw me With this patch the library is mapped at a 2MB aligned address: fbdfe200000-7fbdfe4b4000 r-xp 00000000 00:1e 199856 /usr/lib64/libcrypto.so.1.1.1k Size: 2768 kB THPeligible: 1 VmFlags: rd ex mr mw me This fixes the alignment of VMAs for any mmap of a file that has the rd and ex permissions and size >= 2MB. The VMA alignment and THPeligible field for anonymous memory is handled separately and is thus not effected by this change. CC: stable@vger.kernel.org # 5.18+ Signed-off-by: Alexander Zhu <alexlzhu@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	5f4403e10f	btrfs: add lockdep annotations for the ordered extents wait event This wait event is very similar to the pending ordered wait event in the sense that it occurs in a different context than the condition signaling for the event. The signaling occurs in btrfs_remove_ordered_extent() while the wait event is implemented in btrfs_start_ordered_extent() in fs/btrfs/ordered-data.c However, in this case a thread must not acquire the lockdep map for the ordered extents wait event when the ordered extent is related to a free space inode. That is because lockdep creates dependencies between locks acquired both in execution paths related to normal inodes and paths related to free space inodes, thus leading to false positives. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	9d7464c87b	btrfs: change the lockdep class of free space inode's invalidate_lock Reinitialize the class of the lockdep map for struct inode's mapping->invalidate_lock in load_free_space_cache() function in fs/btrfs/free-space-cache.c. This will prevent lockdep from producing false positives related to execution paths that make use of free space inodes and paths that make use of normal inodes. Specifically, with this change lockdep will create separate lock dependencies that include the invalidate_lock, in the case that free space inodes are used and in the case that normal inodes are used. The lockdep class for this lock was first initialized in inode_init_always() in fs/inode.c. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	8b53779eaa	btrfs: add lockdep annotations for pending_ordered wait event In contrast to the num_writers and num_extwriters wait events, the condition for the pending ordered wait event is signaled in a different context from the wait event itself. The condition signaling occurs in btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait event is implemented in btrfs_commit_transaction() in fs/btrfs/transaction.c Thus the thread signaling the condition has to acquire the lockdep map as a reader at the start of btrfs_remove_ordered_extent() and release it after it has signaled the condition. In this case some dependencies might be left out due to the placement of the annotation, but it is better than no annotation at all. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	3e738c531a	btrfs: add lockdep annotations for transaction states wait events Add lockdep annotations for the transaction states that have wait events; 1) TRANS_STATE_COMMIT_START 2) TRANS_STATE_UNBLOCKED 3) TRANS_STATE_SUPER_COMMITTED 4) TRANS_STATE_COMPLETED The new macros introduced here to annotate the transaction states wait events have the same effect as the generic lockdep annotation macros. With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START the transaction thread has to acquire the lockdep maps for the transaction states as reader after the lockdep map for num_writers is released so that lockdep does not complain. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	5a9ba6709f	btrfs: add lockdep annotations for num_extwriters wait event Similarly to the num_writers wait event in fs/btrfs/transaction.c add a lockdep annotation for the num_extwriters wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	e1489b4fe6	btrfs: add lockdep annotations for num_writers wait event Annotate the num_writers wait event in fs/btrfs/transaction.c with lockdep in order to catch deadlocks involving this wait event. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:53 +02:00
Ioannis Angelakopoulos	ab9a323f9a	btrfs: add macros for annotating wait events with lockdep Introduce four macros that are used to annotate wait events in btrfs code with lockdep; 1) the btrfs_lockdep_init_map 2) the btrfs_lockdep_acquire, 3) the btrfs_lockdep_release 4) the btrfs_might_wait_for_event macros. The btrfs_lockdep_init_map macro is used to initialize a lockdep map. The btrfs_lockdep_<acquire,release> macros are used by threads to take the lockdep map as readers (shared lock) and release it, respectively. The btrfs_might_wait_for_event macro is used by threads to take the lockdep map as writers (exclusive lock) and release it. In general, the lockdep annotation for wait events work as follows: The condition for a wait event can be modified and signaled at the same time by multiple threads. These threads hold the lockdep map as readers when they enter a context in which blocking would prevent signaling the condition. Frequently, this occurs when a thread violates a condition (lockdep map acquire), before restoring it and signaling it at a later point (lockdep map release). The threads that block on the wait event take the lockdep map as writers (exclusive lock). These threads have to block until all the threads that hold the lockdep map as readers signal the condition for the wait event and release the lockdep map. The lockdep annotation is used to warn about potential deadlock scenarios that involve the threads that modify and signal the wait event condition and threads that block on the wait event. A simple example is illustrated below: Without lockdep: TA TB cond = false lock(A) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) With lockdep: TA TB rwsem_acquire_read(lockdep_map) cond = false lock(A) rwsem_acquire(lockdep_map) rwsem_release(lockdep_map) wait_event(w, cond) unlock(A) lock(A) cond = true signal(w) unlock(A) rwsem_release(lockdep_map) In the second case, with the lockdep annotation, lockdep would warn about an ABBA deadlock, while the first case would just deadlock at some point. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:52 +02:00
Qu Wenruo	62cd9d4474	btrfs: dump extra info if one free space cache has more bitmaps than it should There is an internal report on hitting the following ASSERT() in recalculate_thresholds(): ASSERT(ctl->total_bitmaps <= max_bitmaps); Above @max_bitmaps is calculated using the following variables: - bytes_per_bg 8 * 4096 * 4096 (128M) for x86_64/x86. - block_group->length The length of the block group. @max_bitmaps is the rounded up value of block_group->length / 128M. Normally one free space cache should not have more bitmaps than above value, but when it happens the ASSERT() can be triggered if CONFIG_BTRFS_ASSERT is also enabled. But the ASSERT() itself won't provide enough info to know which is going wrong. Is the bg too small thus it only allows one bitmap? Or is there something else wrong? So although I haven't found extra reports or crash dump to do further investigation, add the extra info to make it more helpful to debug. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-26 12:27:52 +02:00
Miklos Szeredi	863f144f12	vfs: open inside ->tmpfile() This is in preparation for adding tmpfile support to fuse, which requires that the tmpfile creation and opening are done as a single operation. Replace the 'struct dentry ' argument of i_op->tmpfile with 'struct file '. Call finish_open_simple() as the last thing in ->tmpfile() instances (may be omitted in the error case). Change d_tmpfile() argument to 'struct file *' as well to make callers more readable. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2022-09-24 07:00:00 +02:00
Linus Torvalds	60891ec99e	for-6.0-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMpskIACgkQxWXV+ddt WDtxGA//Z4Z9e0p9CTwBGla9eqflpfPQLya93ANEBqhV/S1wxgvQtj+Q2XpGIqhj AVR4ZqEmnFPmAOay5s/mGQ+wZ3dyR+n/XLZ8XsViXY5yBLnRpZJi8p5ozqYuSm59 1A4FF0ZciD73jql8hPodsd1VFkKqtOTmPFyCxHk2lt/Z36FFYKCUm4P8ALdMxlct 6uEp67PI9Pb6PANq4mj8lpNTnsD2wTKDHqQ3WkHBwuHkEOCVkPbRsBlUkUqpYi0h Lc0XhjcnPX0alfiLFwwNdPZ8vrLE4egktzWA6PqEg1YzBPQQNnuQTHmO25KOqrm1 bW20PGOIF7WFg85w1P20G4I8UdT2CWBEloPSjYTDlD2KTdqBOp95oo7MUQlrDFNm lxns3npylswlvia8nH39iOlwUPL75cDe4U8LkOV+rSHmTmt7B6XK/MfI6sYgmveH V4DUI7BnbfEALbJMsJesHAR/3tnsAPqnLtv+lEF9hM70YXdN2o5iN/D0G/vms3Sr RGVpEFJyJPnzvAg6y3PNTdMEpDtouQHQhHBtPKnfOzRJsgtzk5CTpEBkWPSRLiqm DQj25JdcT8j8Xa8nWppEvogC0hfctqs1ROuZux7KajkxUHEDfXs2l0RR1dEpMvs7 v+Bhw3zLPS0e/b+9HqBSwCo0JAkIWzm6TE00LlKCYsnzNwLZT9k= =4Hu8 -----END PGP SIGNATURE----- Merge tag 'for-6.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - two fixes for hangs in the umount sequence where threads depend on each other and the work must be finished in the right order - in zoned mode, wait for flushing all block group metadata IO before finishing the zone * tag 'for-6.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: wait for extent buffer IOs before finishing a zone btrfs: fix hang during unmount when stopping a space reclaim worker btrfs: fix hang during unmount when stopping block group reclaim worker	2022-09-20 10:23:24 -07:00
Christoph Hellwig	4088a47e78	btrfs: add manual PSI accounting for compressed reads btrfs compressed reads try to always read the entire compressed chunk, even if only a subset is requested. Currently this is covered by the magic PSI accounting underneath submit_bio, but that is about to go away. Instead add manual psi_memstall_{enter,leave} annotations. Note that for readahead this really should be using readahead_expand, but the additionals reads are also done for plain ->read_folio where readahead_expand can't work, so this overall logic is left as-is for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220915094200.139713-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-20 08:24:38 -06:00
Naohiro Aota	2dd7e7bc02	btrfs: zoned: wait for extent buffer IOs before finishing a zone Before sending REQ_OP_ZONE_FINISH to a zone, we need to ensure that ongoing IOs already finished. Or, we will see a "Zone Is Full" error for the IOs, as the ZONE_FINISH command makes the zone full. We ensure that with btrfs_wait_block_group_reservations() and btrfs_wait_ordered_roots() for a data block group. And, for a metadata block group, the comparison of alloc_offset vs meta_write_pointer mostly ensures IOs for the allocated region already sent. However, there still can be a little time frame where the IOs are sent but not yet completed. Introduce wait_eb_writebacks() to ensure such IOs are completed for a metadata block group. It walks the buffer_radix to find extent buffers in the block group and calls wait_on_extent_buffer_writeback() on them. Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 5.19+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-13 14:05:18 +02:00
Filipe Manana	a362bb864b	btrfs: fix hang during unmount when stopping a space reclaim worker Often when running generic/562 from fstests we can hang during unmount, resulting in a trace like this: Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00 Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds. Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1 Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000 Sep 07 11:55:32 debian9 kernel: Call Trace: Sep 07 11:55:32 debian9 kernel: <TASK> Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0 Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70 Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0 Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130 Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0 Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420 Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0 Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200 Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0 Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530 Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140 Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30 Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0 Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs] Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170 Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs] Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0 Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120 Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30 Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs] Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0 Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160 Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0 Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0 Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40 Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90 Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7 Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7 Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0 Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570 Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000 Sep 07 11:55:32 debian9 kernel: </TASK> What happens is the following: 1) The cleaner kthread tries to start a transaction to delete an unused block group, but the metadata reservation can not be satisfied right away, so a reservation ticket is created and it starts the async metadata reclaim task (fs_info->async_reclaim_work); 2) Writeback for all the filler inodes with an i_size of 2K starts (generic/562 creates a lot of 2K files with the goal of filling metadata space). We try to create an inline extent for them, but we fail when trying to insert the inline extent with -ENOSPC (at cow_file_range_inline()) - since this is not critical, we fallback to non-inline mode (back to cow_file_range()), reserve extents, create extent maps and create the ordered extents; 3) An unmount starts, enters close_ctree(); 4) The async reclaim task is flushing stuff, entering the flush states one by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current delayed iputs. After running the delayed iputs and before calling btrfs_wait_on_delayed_iputs(), one or more ordered extents complete, and btrfs_add_delayed_iput() is called for each one through btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results in bumping fs_info->nr_delayed_iputs from 0 to some positive value. So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting for fs_info->nr_delayed_iputs to become 0; 5) The current transaction is committed by the transaction kthread, we then start unpinning extents and end up calling btrfs_try_granting_tickets() through unpin_extent_range(), since we released some space. This results in satisfying the ticket created by the cleaner kthread at step 1, waking up the cleaner kthread; 6) At close_ctree() we ask the cleaner kthread to park; 7) The cleaner kthread starts the transaction, deletes the unused block group, and then calls kthread_should_park(), which returns true, so it parks. And at this point we have the delayed iputs added by the completion of the ordered extents still pending; 8) Then later at close_ctree(), when we call: cancel_work_sync(&fs_info->async_reclaim_work); We hang forever, since the cleaner was parked and no one else can run delayed iputs after that, while the reclaim task is waiting for the remaining delayed iputs to be completed. Fix this by waiting for all ordered extents to complete and running the delayed iputs before attempting to stop the async reclaim tasks. Note that we can not wait for ordered extents with btrfs_wait_ordered_roots() (or other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE flag to be set on an ordered extent, but the delayed iput is added after that, when doing the final btrfs_put_ordered_extent(). So instead wait for the work queues used for executing ordered extent completion to be empty, which works because we do the final put on an ordered extent at btrfs_finish_ordered_io() (while we are in the unmount context). Fixes: `d6fd0ae25c` ("Btrfs: fix missing delayed iputs on unmount") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-13 14:05:13 +02:00
Filipe Manana	8a1f1e3d1e	btrfs: fix hang during unmount when stopping block group reclaim worker During early unmount, at close_ctree(), we try to stop the block group reclaim task with cancel_work_sync(), but that may hang if the block group reclaim task is currently at btrfs_relocate_block_group() waiting for the flag BTRFS_FS_UNFINISHED_DROPS to be cleared from fs_info->flags. During unmount we only clear that flag later, after trying to stop the block group reclaim task. Fix that by clearing BTRFS_FS_UNFINISHED_DROPS before trying to stop the block group reclaim task and after setting BTRFS_FS_CLOSING_START, so that if the reclaim task is waiting on that bit, it will stop immediately after being woken, because it sees the filesystem is closing (with a call to btrfs_fs_closing()), and then returns immediately with -EINTR. Fixes: `31e70e5278` ("btrfs: fix hang during unmount when block group reclaim task is running") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-13 14:05:08 +02:00
Vishal Moola (Oracle)	47d5541995	btrfs: convert process_page_range() to use filemap_get_folios_contig() Converted function to use folios throughout. This is in preparation for the removal of find_get_pages_contig(). Now also supports large folios. Since we may receive more than nr_pages pages, nr_pages may underflow. Since nr_pages > 0 is equivalent to index <= end_index, we replaced it with this check instead. Also minor comment renaming for consistency in subpage. Link: https://lkml.kernel.org/r/20220824004023.77310-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Sterba <dsterb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-11 20:26:03 -07:00
Vishal Moola (Oracle)	a75b81c3f6	btrfs: convert end_compressed_writeback() to use filemap_get_folios() Converted function to use folios throughout. This is in preparation for the removal of find_get_pages_contig(). Now also supports large folios. Since we may receive more than nr_pages pages, nr_pages may underflow. Since nr_pages > 0 is equivalent to index <= end_index, we replaced it with this check instead. Also this function does not care about the pages being contiguous so we can just use filemap_get_folios() to be more efficient. Link: https://lkml.kernel.org/r/20220824004023.77310-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterb@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-11 20:26:03 -07:00
Vishal Moola (Oracle)	04c6b79ae4	btrfs: convert __process_pages_contig() to use filemap_get_folios_contig() Convert to use folios throughout. This is in preparation for the removal of find_get_pages_contig(). Now also supports large folios. Since we may receive more than nr_pages pages, nr_pages may underflow. Since nr_pages > 0 is equivalent to index <= end_index, we replaced it with this check instead. Link: https://lkml.kernel.org/r/20220824004023.77310-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterb@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-11 20:26:02 -07:00
Linus Torvalds	9b45094954	for-6.0-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMaPukACgkQxWXV+ddt WDsWKw/+IcpMsb08sjudn4dtFQ3HSA1E+dOYDzXwUJTS7ZpZhLRniLe1XQwHxe4D 7DUQA+e1RKGq4+TiznoLhaG/YCCcrLPZL/1aWhwO0M5Wj6BCIxSUa00BJNpxyBMw kWb9vQltc5w5zJXHeIr7m2ByzT+YIl0v1lf2GQrJVieHhGiKslfkJHLoJt49oJ0L 9ka183VR/OCi/3uxUw6NMAjfv+0OGEsFZX/CF8Vo64IKg0I0Q248H4enZt43aDHA dQDapAyAr4f6RLDs6ULS2GSzKfZIKMLHlvSeg1BSPyUt/NZFVlC0VwVX0NmwP62a 5NECYdimlQOGSlaahNEQpLIiyNYboi3Mq7m63BofWduDQanpnM1FByln9JVEizlm VuUs3+O0CMp81HecSk3VbSe3ukO2fqAdQjM5cdpRx30TYu7WRiYNE3aHchgLmXLP 0zw9JV6ePg04Mstx+/3lo8D/X/7fMAT3NrqYmuImoekFWbdJfsiUtgdXNOglT9dt 6lb1/0jBEbdiXnQ/jT1OreGwSdGZqkEKF4OE26kPRxURyTDESzglNVyhXmshIANC qnNuUFGea5d7LbyozYyfdcsQS7rEqLVKmUWrOb/3O/K1947/DegYodnhRwjCUSS7 iUaetkYUWxHa7U9303KneCUAyLEf1S8NXRPIObL6YIw7D09wato= =WD7B -----END PGP SIGNATURE----- Merge tag 'for-6.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes to zoned mode and one regression fix for chunk limit: - Zoned mode fixes: - fix how wait/wake up is done when finishing zone - fix zone append limit in emulated mode - fix mount on devices with conventional zones - fix regression, user settable data chunk limit got accidentally lowered and causes allocation problems on some profiles (raid0, raid1)" * tag 'for-6.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix the max chunk size and stripe length calculation btrfs: zoned: fix mounting with conventional zones btrfs: zoned: set pseudo max append zone limit in zone emulation mode btrfs: zoned: fix API misuse of zone finish waiting	2022-09-09 07:54:19 -04:00
Qu Wenruo	5da431b71d	btrfs: fix the max chunk size and stripe length calculation [BEHAVIOR CHANGE] Since commit `f6fca3917b` ("btrfs: store chunk size in space-info struct"), btrfs no longer can create larger data chunks than 1G: mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4 mount $dev1 $mnt btrfs balance start --full $mnt btrfs balance start --full $mnt umount $mnt btrfs ins dump-tree -t chunk $dev1 \| grep "DATA\|RAID0" -C 2 Before that offending commit, what we got is a 4G data chunk: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176 length 4294967296 owner 2 stripe_len 65536 type DATA\|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 Now what we got is only 1G data chunk: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176 length 1073741824 owner 2 stripe_len 65536 type DATA\|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 This will increase the number of data chunks by the number of devices, not only increase system chunk usage, but also greatly increase mount time. Without a proper reason, we should not change the max chunk size. [CAUSE] Previously, we set max data chunk size to 10G, while max data stripe length to 1G. Commit `f6fca3917b` ("btrfs: store chunk size in space-info struct") completely ignored the 10G limit, but use 1G max stripe limit instead, causing above shrink in max data chunk size. [FIX] Fix the max data chunk size to 10G, and in decide_stripe_size_regular() we limit stripe_size to 1G manually. This should only affect data chunks, as for metadata chunks we always set the max stripe size the same as max chunk size (256M or 1G depending on fs size). Now the same script result the same old result: item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176 length 4294967296 owner 2 stripe_len 65536 type DATA\|RAID0 io_align 65536 io_width 65536 sector_size 4096 num_stripes 4 sub_stripes 1 Reported-by: Wang Yugui <wangyugui@e16-tech.com> Fixes: `f6fca3917b` ("btrfs: store chunk size in space-info struct") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-06 17:49:58 +02:00
Johannes Thumshirn	6ca64ac276	btrfs: zoned: fix mounting with conventional zones Since commit `6a921de589` ("btrfs: zoned: introduce space_info->active_total_bytes"), we're only counting the bytes of a block group on an active zone as usable for metadata writes. But on a SMR drive, we don't have active zones and short circuit some of the logic. This leads to an error on mount, because we cannot reserve space for metadata writes. Fix this by also setting the BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE bit in the block-group's runtime flag if the zone is a conventional zone. Fixes: `6a921de589` ("btrfs: zoned: introduce space_info->active_total_bytes") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-05 15:39:10 +02:00
Shin'ichiro Kawasaki	cac5c44c48	btrfs: zoned: set pseudo max append zone limit in zone emulation mode The commit `7d7672bc5d` ("btrfs: convert count_max_extents() to use fs_info->max_extent_size") introduced a division by fs_info->max_extent_size. This max_extent_size is initialized with max zone append limit size of the device btrfs runs on. However, in zone emulation mode, the device is not zoned then its zone append limit is zero. This resulted in zero value of fs_info->max_extent_size and caused zero division error. Fix the error by setting non-zero pseudo value to max append zone limit in zone emulation mode. Set the pseudo value based on max_segments as suggested in the commit `c2ae7b772e` ("btrfs: zoned: revive max_zone_append_bytes"). Fixes: `7d7672bc5d` ("btrfs: convert count_max_extents() to use fs_info->max_extent_size") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-05 15:32:26 +02:00
Naohiro Aota	d5b81ced74	btrfs: zoned: fix API misuse of zone finish waiting The commit `2ce543f478` ("btrfs: zoned: wait until zone is finished when allocation didn't progress") implemented a zone finish waiting mechanism to the write path of zoned mode. However, using wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and wait_var_event() just hangs because no one ever wakes it up once it goes into sleep. Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit() on fs_info->flags with a proper barrier installed. Fixes: `2ce543f478` ("btrfs: zoned: wait until zone is finished when allocation didn't progress") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-09-05 15:32:21 +02:00
Linus Torvalds	8379c0b31f	for-6.0-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMLY9oACgkQxWXV+ddt WDue/w/8C3ZF8nLAI/sMrUpef2vSD62bvkKRRS45wzR2uod6yc0Fle9upzBssJQZ qO3mQ53+QV+imCq7dY5mmtmwCUJNmbV5gbiMoF1OoV9TYtpZb/NIDklSX8se2eJX drdAWQr2pYwU2M4duA4IEW08TvQ2TFh0JiqMi0aYM5apyL80uv3WniOu+xpRipA3 CMFAnDqayIgQ5OIsedqNy2MBLBopodUL5PZv/H7/g6KSKIuAZP9zgg1eKPfaz2t3 HO183ubmMbVtxgxeu+EnvCkg/iQ5hQiuGmyi0FLYMs/A6/NglwBnIJU5jCMQhcp6 HO5+FSUn6lHQetVzt2uHb9Lo+gX4FtCaHqVv1bXT62lnmDsZO1D7RVSg1Fra+CY+ jJmi8vvIbfbYlSZPZlJANoWe8ODOMVPk+pM4SFHlxOWGAY6HViX2RfHnIjNj5x9O iDSTGvH6++nBF1Wu2/Xja/VKZ1avxRyTu2srW8JOF62j/tTU/EoPJcO9rxXOBBmC Hi4UmJ690p3h5xZeeiyE8CmaSlPtfdCcnc/97FnusEjBao9O7THX0PCDVJX6VBkm hVk01Z6+az1UNcD18KecvCpKYF/At4WpjaUGgf7q+LBfJXuXA6jfzOVDJMKV3TFd n1yMFg+duGj90l8gT0aa/VQiBlUlnzQKz6ceqyKkPccwveNis6I= =p8YV -----END PGP SIGNATURE----- Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes: - check that subvolume is writable when changing xattrs from security namespace - fix memory leak in device lookup helper - update generation of hole file extent item when merging holes - fix space cache corruption and potential double allocations; this is a rare bug but can be serious once it happens, stable backports and analysis tool will be provided - fix error handling when deleting root references - fix crash due to assert when attempting to cancel suspended device replace, add message what to do if mount fails due to missing replace item Regressions: - don't merge pages into bio if their page offset is not contiguous - don't allow large NOWAIT direct reads, this could lead to short reads eg. in io_uring" * tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add info when mount fails due to stale replace target btrfs: replace: drop assert for suspended replace btrfs: fix silent failure when deleting root reference btrfs: fix space cache corruption and potential double allocations btrfs: don't allow large NOWAIT direct reads btrfs: don't merge pages into bio if their page offset is not contiguous btrfs: update generation of hole file extent item when merging holes btrfs: fix possible memory leak in btrfs_get_dev_args_from_path() btrfs: check if root is readonly while setting security xattr	2022-08-28 10:44:04 -07:00
Anand Jain	f2c3bec215	btrfs: add info when mount fails due to stale replace target If the replace target device reappears after the suspended replace is cancelled, it blocks the mount operation as it can't find the matching replace-item in the metadata. As shown below, BTRFS error (device sda5): replace devid present without an active replace item To overcome this situation, the user can run the command btrfs device scan --forget <replace target device> and try the mount command again. And also, to avoid repeating the issue, superblock on the devid=0 must be wiped. wipefs -a device-path-to-devid=0. This patch adds some info when this situation occurs. Reported-by: Samuel Greiner <samuel@balkonien.org> Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/ CC: stable@vger.kernel.org # 5.0+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-23 22:15:21 +02:00
Anand Jain	59a3991984	btrfs: replace: drop assert for suspended replace If the filesystem mounts with the replace-operation in a suspended state and try to cancel the suspended replace-operation, we hit the assert. The assert came from the commit `fe97e2e173` ("btrfs: dev-replace: replace's scrub must not be running in suspended state") that was actually not required. So just remove it. $ mount /dev/sda5 /btrfs BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded' $ mount -o degraded /dev/sda5 /btrfs <-- success. $ btrfs replace cancel /btrfs kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131 kernel: ------------[ cut here ]------------ kernel: kernel BUG at fs/btrfs/ctree.h:3750! After the patch: $ btrfs replace cancel /btrfs BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled Fixes: `fe97e2e173` ("btrfs: dev-replace: replace's scrub must not be running in suspended state") CC: stable@vger.kernel.org # 5.0+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-23 22:15:21 +02:00
Filipe Manana	47bf225a8d	btrfs: fix silent failure when deleting root reference At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end up returning from the function with a value of 0 (success). This happens because the function returns the value stored in the variable 'err', which is 0, while the error value we got from btrfs_search_slot() is stored in the 'ret' variable. So fix it by setting 'err' with the error value. Fixes: `8289ed9f93` ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-23 22:15:21 +02:00
Omar Sandoval	ced8ecf026	btrfs: fix space cache corruption and potential double allocations When testing space_cache v2 on a large set of machines, we encountered a few symptoms: 1. "unable to add free space :-17" (EEXIST) errors. 2. Missing free space info items, sometimes caught with a "missing free space info for X" error. 3. Double-accounted space: ranges that were allocated in the extent tree and also marked as free in the free space tree, ranges that were marked as allocated twice in the extent tree, or ranges that were marked as free twice in the free space tree. If the latter made it onto disk, the next reboot would hit the BUG_ON() in add_new_free_space(). 4. On some hosts with no on-disk corruption or error messages, the in-memory space cache (dumped with drgn) disagreed with the free space tree. All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit `d0c2f4fa55` ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") subtly broke this by allowing multiple transactions to be unpinning extents at the same time. Specifically, the race is as follows: 1. An extent is deleted from an uncached block group in transaction A. 2. btrfs_commit_transaction() is called for transaction A. 3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed ref for the deleted extent. 4. __btrfs_free_extent() -> do_free_extent_accounting() -> add_to_free_space_tree() adds the deleted extent back to the free space tree. 5. do_free_extent_accounting() -> btrfs_update_block_group() -> btrfs_cache_block_group() queues up the block group to get cached. block_group->progress is set to block_group->start. 6. btrfs_commit_transaction() for transaction A calls switch_commit_roots(). It sets block_group->last_byte_to_unpin to block_group->progress, which is block_group->start because the block group hasn't been cached yet. 7. The caching thread gets to our block group. Since the commit roots were already switched, load_free_space_tree() sees the deleted extent as free and adds it to the space cache. It finishes caching and sets block_group->progress to U64_MAX. 8. btrfs_commit_transaction() advances transaction A to TRANS_STATE_SUPER_COMMITTED. 9. fsync calls btrfs_commit_transaction() for transaction B. Since transaction A is already in TRANS_STATE_SUPER_COMMITTED and the commit is for fsync, it advances. 10. btrfs_commit_transaction() for transaction B calls switch_commit_roots(). This time, the block group has already been cached, so it sets block_group->last_byte_to_unpin to U64_MAX. 11. btrfs_commit_transaction() for transaction A calls btrfs_finish_extent_commit(), which calls unpin_extent_range() for the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by transaction B!), so it adds the deleted extent to the space cache again! This explains all of our symptoms above: * If the sequence of events is exactly as described above, when the free space is re-added in step 11, it will fail with EEXIST. * If another thread reallocates the deleted extent in between steps 7 and 11, then step 11 will silently re-add that space to the space cache as free even though it is actually allocated. Then, if that space is allocated again, the free space tree will be corrupted (namely, the wrong item will be deleted). * If we don't catch this free space tree corruption, it will continue to get worse as extents are deleted and reallocated. The v1 space_cache is synchronously loaded when an extent is deleted (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group() with load_cache_only=1), so it is not normally affected by this bug. However, as noted above, if we fail to load the space cache, we will fall back to caching from the extent tree and may hit this bug. The easiest fix for this race is to also make caching from the free space tree or extent tree synchronous. Josef tested this and found no performance regressions. A few extra changes fall out of this change. Namely, this fix does the following, with step 2 being the crucial fix: 1. Factor btrfs_caching_ctl_wait_done() out of btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl that we already hold a reference to. 2. Change the call in btrfs_cache_block_group() of btrfs_wait_space_cache_v1_finished() to btrfs_caching_ctl_wait_done(), which makes us wait regardless of the space_cache option. 3. Delete the now unused btrfs_wait_space_cache_v1_finished() and space_cache_v1_done(). 4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to `bool wait` to more accurately describe its new meaning. 5. Change a few callers which had a separate call to btrfs_wait_block_group_cache_done() to use wait = true instead. 6. Make btrfs_wait_block_group_cache_done() static now that it's not used outside of block-group.c anymore. Fixes: `d0c2f4fa55` ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-23 22:13:54 +02:00
Josef Bacik	79d3d1d12e	btrfs: don't allow large NOWAIT direct reads Dylan and Jens reported a problem where they had an io_uring test that was returning short reads, and bisected it to `ee5b46a353` ("btrfs: increase direct io read size limit to 256 sectors"). The root cause is their test was doing larger reads via io_uring with NOWAIT and async. This was triggering a page fault during the direct read, however the first page was able to work just fine and thus we submitted a 4k read for a larger iocb. Btrfs allows for partial IO's in this case specifically because we don't allow page faults, and thus we'll attempt to do any io that we can, submit what we could, come back and fault in the rest of the range and try to do the remaining IO. However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the partial dio is done, which is incorrect. In the sync case we can exit the iomap code, submit more io's, and return with the amount of IO we were able to complete successfully. We were always doing short reads in this case, but for NOWAIT we were getting saved by the fact that we were limiting direct reads to sectorsize, and if we were larger than that we would return EAGAIN. Fix the regression by simply returning EAGAIN in the NOWAIT case with larger reads, that way io_uring can retry and get the larger IO and have the fault logic handle everything properly. This still leaves the AIO short read case, but that existed before this change. The way to properly fix this would be to handle partial iocb completions, but that's a lot of work, for now deal with the regression in the most straightforward way possible. Reported-by: Dylan Yudaken <dylany@fb.com> Fixes: `ee5b46a353` ("btrfs: increase direct io read size limit to 256 sectors") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-22 18:08:07 +02:00
Qu Wenruo	4a445b7b61	btrfs: don't merge pages into bio if their page offset is not contiguous [BUG] Zygo reported on latest development branch, he could hit ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally corrupt one disk, and let btrfs to recover the data during read/scrub). And The following minimal reproducer can cause extent state leakage at rmmod time: mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null mount $dev1 $mnt fsstress -w -d $mnt -n 25 -s 1660807876 sync fssum -A -f -w /tmp/fssum.saved $mnt umount $mnt # Wipe the dev1 but keeps its super block xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1 mount $dev1 $mnt fssum -r /tmp/fssum.saved $mnt > /dev/null umount $mnt rmmod btrfs This will lead to the following extent states leakage: BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1 BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1 BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1 BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1 BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1 BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1 BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1 BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1 [CAUSE] Since commit `7aa51232e2` ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to determine the file offset of a page. But that usage assume that, one bio has all its page having a continuous page offsets. Unfortunately that's not true, btrfs only requires the logical bytenr contiguous when assembling its bios. From above script, we have one bio looks like this: fssum-27671 submit_one_bio: bio logical=217739264 len=36864 fssum-27671 submit_one_bio: r/i=5/261 page_offset=466944 <<< fssum-27671 submit_one_bio: r/i=5/261 page_offset=724992 <<< fssum-27671 submit_one_bio: r/i=5/261 page_offset=729088 fssum-27671 submit_one_bio: r/i=5/261 page_offset=733184 fssum-27671 submit_one_bio: r/i=5/261 page_offset=737280 fssum-27671 submit_one_bio: r/i=5/261 page_offset=741376 fssum-27671 submit_one_bio: r/i=5/261 page_offset=745472 fssum-27671 submit_one_bio: r/i=5/261 page_offset=749568 fssum-27671 submit_one_bio: r/i=5/261 page_offset=753664 Note that the 1st and the 2nd page has non-contiguous page offsets. This means, at repair time, we will have completely wrong file offset passed in: kworker/u32:2-19927 btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192 Since the file offset is incorrect, we latter incorrectly set the extent states, and no way to really release them. Thus later it causes the leakage. In fact, this can be even worse, since the file offset is incorrect, we can hit cases like the incorrect file offset belongs to a HOLE, and later cause btrfs_num_copies() to trigger error, finally hit BUG_ON()/ASSERT() later. [FIX] Add an extra condition in btrfs_bio_add_page() for uncompressed IO. Now we will have more strict requirement for bio pages: - They should all have the same mapping (the mapping check is already implied by the call chain) - Their logical bytenr should be adjacent This is the same as the old condition. - Their page_offset() (file offset) should be adjacent This is the new check. This would result a slightly increased amount of bios from btrfs (needs holes and inside the same stripe boundary to trigger). But this would greatly reduce the confusion, as it's pretty common to assume a btrfs bio would only contain continuous page cache. Later we may need extra cleanups, as we no longer needs to handle gaps between page offsets in endio functions. Currently this should be the minimal patch to fix commit `7aa51232e2` ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector"). Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: `7aa51232e2` ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-22 18:06:58 +02:00
Filipe Manana	e6e3dec6c3	btrfs: update generation of hole file extent item when merging holes When punching a hole into a file range that is adjacent with a hole and we are not using the no-holes feature, we expand the range of the adjacent file extent item that represents a hole, to save metadata space. However we don't update the generation of hole file extent item, which means a full fsync will not log that file extent item if the fsync happens in a later transaction (since commit `7f30c07288` ("btrfs: stop copying old file extents when doing a full fsync")). For example, if we do this: $ mkfs.btrfs -f -O ^no-holes /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar $ sync We end up with 2 file extent items in our file: 1) One that represents the hole for the file range [0, 2M), with a generation of 7; 2) Another one that represents an extent covering the range [2M, 4M). After that if we do the following: $ xfs_io -c "fpunch 2M 2M" /mnt/foobar We end up with a single file extent item in the file, which represents a hole for the range [0, 4M) and with a generation of 7 - because we end dropping the data extent for range [2M, 4M) and then update the file extent item that represented the hole at [0, 2M), by increasing length from 2M to 4M. Then doing a full fsync and power failing: $ xfs_io -c "fsync" /mnt/foobar <power failure> will result in the full fsync not logging the file extent item that represents the hole for the range [0, 4M), because its generation is 7, which is lower than the generation of the current transaction (8). As a consequence, after mounting again the filesystem (after log replay), the region [2M, 4M) does not have a hole, it still points to the previous data extent. So fix this by always updating the generation of existing file extent items representing holes when we merge/expand them. This solves the problem and it's the same approach as when we merge prealloc extents that got written (at btrfs_mark_extent_written()). Setting the generation to the current transaction's generation is also what we do when merging the new hole extent map with the previous one or the next one. A test case for fstests, covering both cases of hole file extent item merging (to the left and to the right), will be sent soon. Fixes: `7f30c07288` ("btrfs: stop copying old file extents when doing a full fsync") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-22 18:06:42 +02:00
Zixuan Fu	9ea0106a7a	btrfs: fix possible memory leak in btrfs_get_dev_args_from_path() In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if the path is invalid. In this case, btrfs_get_dev_args_from_path() returns directly without freeing args->uuid and args->fsid allocated before, which causes memory leak. To fix these possible leaks, when btrfs_get_bdev_and_sb() fails, btrfs_put_dev_args_from_path() is called to clean up the memory. Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Fixes: `faa775c41d` ("btrfs: add a btrfs_get_dev_args_from_path helper") CC: stable@vger.kernel.org # 5.16 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Zixuan Fu <r33s3n6@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-22 18:06:33 +02:00
Goldwyn Rodrigues	b51111271b	btrfs: check if root is readonly while setting security xattr For a filesystem which has btrfs read-only property set to true, all write operations including xattr should be denied. However, security xattr can still be changed even if btrfs ro property is true. This happens because xattr_permission() does not have any restrictions on security., system. and in some cases trusted.* from VFS and the decision is left to the underlying filesystem. See comments in xattr_permission() for more details. This patch checks if the root is read-only before performing the set xattr operation. Testcase: DEV=/dev/vdb MNT=/mnt mkfs.btrfs -f $DEV mount $DEV $MNT echo "file one" > $MNT/f1 setfattr -n "security.one" -v 2 $MNT/f1 btrfs property set /mnt ro true setfattr -n "security.one" -v 1 $MNT/f1 umount $MNT CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-22 18:06:30 +02:00
Linus Torvalds	42c54d5491	for-6.0-rc1-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmL/dJgACgkQxWXV+ddt WDtUmg/9Fje7L+jAtQLqvJYvvGCFCCs8gkm+Rwh9MLYHcEDnBtRWzDWYTKcfO9WW mPmiNlNPAbnAw/9apeJDvFOrd4Vqr8ZD3Vr0flXl5EJ9QXSTL6JXfaiLS1o6rQ0q OXqxVTh5B5aEmfsEhyJBLzsZGdISJbr60dAE8/ZDX9wo8cDavZ6YxToIroUeizGO dx2DY5A7OQRKTEf4aJgu4zcm1Sq+U5A8M1pcfU4Rhb/YapVgcCc2wrdrw4NOkNJu B/NZFcD0BcF2bZW7uHT5vr8rGzj2xZUCkuggBQ6/0/h7OdRXIzHCanMw4lPy602v IfZASF7eYkq7FHRbANj/WAYHOFl41vS+whqZ+sEI/qrW+ZyrODIzS67sbU/Bsa7+ ZoL6RSlIbcMgz7XVqcc5d8bKNK/Hc3MCyMWlLT0XScm05BiJy5O2iEZ7UOJ4r8E+ 6J/pFD1bNdacp6UrzwEjyXuSufvp4pJdNWn5ttIWYTBygXMT8AJ403yIheiV3KA3 SkoMj4A54tF3G7NhkzfR5sC7hlgcA0njJLzloicmNgP7E12vmcDL2rTdTt/Yl0cw 3w6ztJS+1sWocFSJ43lE2muHlj7wW7QDYPvul1t6yWgO9wtWECo7c/ASeRl0zPkP atAJtsr3uV9/aA2ae9QeWzut1W3hbE5NFQOLPcF/iXN+kxMmesI= =HgV0 -----END PGP SIGNATURE----- Merge tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few short fixes and a lockdep warning fix (needs moving some code): - tree-log replay fixes: - fix error handling when looking up extent refs - fix warning when setting inode number of links - relocation fixes: - reset block group read-only status when relocation fails - unset control structure if transaction fails when starting to process a block group - add lockdep annotations to fix a warning during relocation where blocks temporarily belong to another tree and can lead to reversed dependencies - tree-checker verifies that extent items don't overlap" * tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: tree-checker: check for overlapping extent items btrfs: fix warning during log replay when bumping inode link count btrfs: fix lost error handling when looking up extended ref on log replay btrfs: fix lockdep splat with reloc root extent buffers btrfs: move lockdep class helpers to locking.c btrfs: unset reloc control if transaction commit fails in prepare_to_relocate() btrfs: reset RO counter on block group if we fail to relocate	2022-08-19 13:33:48 -07:00
Josef Bacik	899b7f69f2	btrfs: tree-checker: check for overlapping extent items We're seeing a weird problem in production where we have overlapping extent items in the extent tree. It's unclear where these are coming from, and in debugging we realized there's no check in the tree checker for this sort of problem. Add a check to the tree-checker to make sure that the extents do not overlap each other. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-17 16:20:25 +02:00
Filipe Manana	769030e118	btrfs: fix warning during log replay when bumping inode link count During log replay, at add_link(), we may increment the link count of another inode that has a reference that conflicts with a new reference for the inode currently being processed. During log replay, at add_link(), we may drop (unlink) a reference from some inode in the subvolume tree if that reference conflicts with a new reference found in the log for the inode we are currently processing. After the unlink, If the link count has decreased from 1 to 0, then we increment the link count to prevent the inode from being deleted if it's evicted by an iput() call, because we may have references to add to that inode later on (and we will fixup its link count later during log replay). However incrementing the link count from 0 to 1 triggers a warning: $ cat fs/inode.c (...) void inc_nlink(struct inode *inode) { if (unlikely(inode->i_nlink == 0)) { WARN_ON(!(inode->i_state & I_LINKABLE)); atomic_long_dec(&inode->i_sb->s_remove_count); } (...) The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's never set during log replay. Most of the time, the warning isn't triggered even if we dropped the last reference of the conflicting inode, and this is because: 1) The conflicting inode was previously marked for fixup, through a call to link_to_fixup_dir(), which increments the inode's link count; 2) And the last iput() on the inode has not triggered eviction of the inode, nor was eviction triggered after the iput(). So at add_link(), even if we unlink the last reference of the inode, its link count ends up being 1 and not 0. So this means that if eviction is triggered after link_to_fixup_dir() is called, at add_link() we will read the inode back from the subvolume tree and have it with a correct link count, matching the number of references it has on the subvolume tree. So if when we are at add_link() the inode has exactly one reference only, its link count is 1, and after the unlink its link count becomes 0. So fix this by using set_nlink() instead of inc_nlink(), as the former accepts a transition from 0 to 1 and it's what we use in other similar contexts (like at link_to_fixup_dir(). Also make add_inode_ref() use set_nlink() instead of inc_nlink() to bump the link count from 0 to 1. The warning is actually harmless, but it may scare users. Josef also ran into it recently. CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-17 16:19:50 +02:00
Filipe Manana	7a6b75b799	btrfs: fix lost error handling when looking up extended ref on log replay During log replay, when processing inode references, if we get an error when looking up for an extended reference at __add_inode_ref(), we ignore it and proceed, returning success (0) if no other error happens after the lookup. This is obviously wrong because in case an extended reference exists and it encodes some name not in the log, we need to unlink it, otherwise the filesystem state will not match the state it had after the last fsync. So just make __add_inode_ref() return an error it gets from the extended reference lookup. Fixes: `f186373fef` ("btrfs: extended inode refs") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-17 16:19:45 +02:00
Josef Bacik	b40130b23c	btrfs: fix lockdep splat with reloc root extent buffers We have been hitting the following lockdep splat with btrfs/187 recently WARNING: possible circular locking dependency detected 5.19.0-rc8+ #775 Not tainted ------------------------------------------------------ btrfs/752500 is trying to acquire lock: ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 but task is already holding lock: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-01/1){+.+.}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_init_new_buffer+0x7d/0x2c0 btrfs_alloc_tree_block+0x120/0x3b0 __btrfs_cow_block+0x136/0x600 btrfs_cow_block+0x10b/0x230 btrfs_search_slot+0x53b/0xb70 btrfs_lookup_inode+0x2a/0xa0 __btrfs_update_delayed_inode+0x5f/0x280 btrfs_async_run_delayed_root+0x24c/0x290 btrfs_work_helper+0xf2/0x3e0 process_one_work+0x271/0x590 worker_thread+0x52/0x3b0 kthread+0xf0/0x120 ret_from_fork+0x1f/0x30 -> #1 (btrfs-tree-01){++++}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_search_slot+0x3c3/0xb70 do_relocation+0x10c/0x6b0 relocate_tree_blocks+0x317/0x6d0 relocate_block_group+0x1f1/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}: __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Chain exists of: btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-01/1); lock(btrfs-tree-01); lock(btrfs-tree-01/1); lock(btrfs-treloc-02#2); * DEADLOCK * 7 locks held by btrfs/752500: #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90 #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40 #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400 #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610 #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 stack backtrace: CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x56/0x73 check_noncircular+0xd6/0x100 ? lock_is_held_type+0xe2/0x140 __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 ? __btrfs_tree_lock+0x24/0x110 down_write_nested+0x41/0x80 ? __btrfs_tree_lock+0x24/0x110 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 ? lock_release+0x137/0x2d0 ? _raw_spin_unlock+0x29/0x50 ? release_extent_buffer+0x128/0x180 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 ? lock_is_held_type+0xe2/0x140 ? lock_is_held_type+0xe2/0x140 ? __x64_sys_ioctl+0x88/0xc0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd This isn't necessarily new, it's just tricky to hit in practice. There are two competing things going on here. With relocation we create a snapshot of every fs tree with a reloc tree. Any extent buffers that get initialized here are initialized with the reloc root lockdep key. However since it is a snapshot, any blocks that are currently in cache that originally belonged to the fs tree will have the normal tree lockdep key set. This creates the lock dependency of reloc tree -> normal tree for the extent buffer locking during the first phase of the relocation as we walk down the reloc root to relocate blocks. However this is problematic because the final phase of the relocation is merging the reloc root into the original fs root. This involves searching down to any keys that exist in the original fs root and then swapping the relocated block and the original fs root block. We have to search down to the fs root first, and then go search the reloc root for the block we need to replace. This creates the dependency of normal tree -> reloc tree which is why lockdep complains. Additionally even if we were to fix this particular mismatch with a different nesting for the merge case, we're still slotting in a block that has a owner of the reloc root objectid into a normal tree, so that block will have its lockdep key set to the tree reloc root, and create a lockdep splat later on when we wander into that block from the fs root. Unfortunately the only solution here is to make sure we do not set the lockdep key to the reloc tree lockdep key normally, and then reset any blocks we wander into from the reloc root when we're doing the merged. This solves the problem of having mixed tree reloc keys intermixed with normal tree keys, and then allows us to make sure in the merge case we maintain the lock order of normal tree -> reloc tree We handle this by setting a bit on the reloc root when we do the search for the block we want to relocate, and any block we search into or COW at that point gets set to the reloc tree key. This works correctly because we only ever COW down to the parent node, so we aren't resetting the key for the block we're linking into the fs root. With this patch we no longer have the lockdep splat in btrfs/187. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-17 16:19:12 +02:00
Josef Bacik	0a27a0474d	btrfs: move lockdep class helpers to locking.c These definitions exist in disk-io.c, which is not related to the locking. Move this over to locking.h/c where it makes more sense. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-17 16:19:10 +02:00
Zixuan Fu	85f02d6c85	btrfs: unset reloc control if transaction commit fails in prepare_to_relocate() In btrfs_relocate_block_group(), the rc is allocated. Then btrfs_relocate_block_group() calls relocate_block_group() prepare_to_relocate() set_reloc_control() that assigns rc to the variable fs_info->reloc_ctl. When prepare_to_relocate() returns, it calls btrfs_commit_transaction() btrfs_start_dirty_block_groups() btrfs_alloc_path() kmem_cache_zalloc() which may fail for example (or other errors could happen). When the failure occurs, btrfs_relocate_block_group() detects the error and frees rc and doesn't set fs_info->reloc_ctl to NULL. After that, in btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and then used, which may cause a use-after-free bug. This possible bug can be triggered by calling btrfs_ioctl_balance() before calling btrfs_ioctl_defrag(). To fix this possible bug, in prepare_to_relocate(), check if btrfs_commit_transaction() fails. If the failure occurs, unset_reloc_control() is called to set fs_info->reloc_ctl to NULL. The error log in our fault-injection testing is shown as follows: [ 58.751070] BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x7ca/0x920 [btrfs] ... [ 58.753577] Call Trace: ... [ 58.755800] kasan_report+0x45/0x60 [ 58.756066] btrfs_init_reloc_root+0x7ca/0x920 [btrfs] [ 58.757304] record_root_in_trans+0x792/0xa10 [btrfs] [ 58.757748] btrfs_record_root_in_trans+0x463/0x4f0 [btrfs] [ 58.758231] start_transaction+0x896/0x2950 [btrfs] [ 58.758661] btrfs_defrag_root+0x250/0xc00 [btrfs] [ 58.759083] btrfs_ioctl_defrag+0x467/0xa00 [btrfs] [ 58.759513] btrfs_ioctl+0x3c95/0x114e0 [btrfs] ... [ 58.768510] Allocated by task 23683: [ 58.768777] ____kasan_kmalloc+0xb5/0xf0 [ 58.769069] __kmalloc+0x227/0x3d0 [ 58.769325] alloc_reloc_control+0x10a/0x3d0 [btrfs] [ 58.769755] btrfs_relocate_block_group+0x7aa/0x1e20 [btrfs] [ 58.770228] btrfs_relocate_chunk+0xf1/0x760 [btrfs] [ 58.770655] __btrfs_balance+0x1326/0x1f10 [btrfs] [ 58.771071] btrfs_balance+0x3150/0x3d30 [btrfs] [ 58.771472] btrfs_ioctl_balance+0xd84/0x1410 [btrfs] [ 58.771902] btrfs_ioctl+0x4caa/0x114e0 [btrfs] ... [ 58.773337] Freed by task 23683: ... [ 58.774815] kfree+0xda/0x2b0 [ 58.775038] free_reloc_control+0x1d6/0x220 [btrfs] [ 58.775465] btrfs_relocate_block_group+0x115c/0x1e20 [btrfs] [ 58.775944] btrfs_relocate_chunk+0xf1/0x760 [btrfs] [ 58.776369] __btrfs_balance+0x1326/0x1f10 [btrfs] [ 58.776784] btrfs_balance+0x3150/0x3d30 [btrfs] [ 58.777185] btrfs_ioctl_balance+0xd84/0x1410 [btrfs] [ 58.777621] btrfs_ioctl+0x4caa/0x114e0 [btrfs] ... Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Zixuan Fu <r33s3n6@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-08-17 16:18:58 +02:00
Linus Torvalds	6614a3c316	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/ SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE= =w/UH -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...	2022-08-05 16:32:45 -07:00
Linus Torvalds	353767e4aa	for-5.20-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLnyNUACgkQxWXV+ddt WDt9vA/9HcF+v5EkknyW07tatTap/Hm/ZB86Z5OZi6ikwIEcHsWhp3rUICejm88e GecDPIluDtCtyD6x4stuqkwOm22aDP5q2T9H6+gyw92ozyb436OV1Z8IrmftzXKY EpZO70PHZT+E6E/WYvyoTmmoCrjib7YlqCWZZhSLUFpsqqlOInmHEH49PW6KvM4r acUZ/RxHurKdmI3kNY6ECbAQl6CASvtTdYcVCx8fT2zN0azoLIQxpYa7n/9ca1R6 8WnYilCbLbNGtcUXvO2M3tMZ4/5kvxrwQsUn93ccCJYuiN0ASiDXbLZ2g4LZ+n56 JGu+y5v5oBwjpVf+46cuvnENP5BQ61594WPseiVjrqODWnPjN28XkcVC0XmPsiiZ lszeHO2cuIrIFoCah8ELMl8usu8+qxfXmPxIXtPu9rEyKsDtOjxVYc8SMXqLp0qQ qYtBoFm0JcZHqtZRpB+dhQ37/xXtH4ljUi/mI6x8iALVujeR273URs7yO9zgIdeW uZoFtbwpHFLUk+TL7Ku82/zOXp3fCwtDpNmlYbxeMbea/be3ShjncM4+mYzvHYri dYON2LFrq+mnRDqtIXTCaAYwX7zU8Y18Ev9QwlNll8dKlKwS89+jpqLoa+eVYy3c /HitHFza70KxmOj4dvDVZlzDpPvl7kW1UBkmskg4u3jnNWzedkM= =sS1q -----END PGP SIGNATURE----- Merge tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This brings some long awaited changes, the send protocol bump, otherwise lots of small improvements and fixes. The main core part is reworking bio handling, cleaning up the submission and endio and improving error handling. There are some changes outside of btrfs adding helpers or updating API, listed at the end of the changelog. Features: - sysfs: - export chunk size, in debug mode add tunable for setting its size - show zoned among features (was only in debug mode) - show commit stats (number, last/max/total duration) - send protocol updated to 2 - new commands: - ability write larger data chunks than 64K - send raw compressed extents (uses the encoded data ioctls), ie. no decompression on send side, no compression needed on receive side if supported - send 'otime' (inode creation time) among other timestamps - send file attributes (a.k.a file flags and xflags) - this is first version bump, backward compatibility on send and receive side is provided - there are still some known and wanted commands that will be implemented in the near future, another version bump will be needed, however we want to minimize that to avoid causing usability issues - print checksum type and implementation at mount time - don't print some messages at mount (mentioned as people asked about it), we want to print messages namely for new features so let's make some space for that - big metadata - this has been supported for a long time and is not a feature that's worth mentioning - skinny metadata - same reason, set by default by mkfs Performance improvements: - reduced amount of reserved metadata for delayed items - when inserted items can be batched into one leaf - when deleting batched directory index items - when deleting delayed items used for deletion - overall improved count of files/sec, decreased subvolume lock contention - metadata item access bounds checker micro-optimized, with a few percent of improved runtime for metadata-heavy operations - increase direct io limit for read to 256 sectors, improved throughput by 3x on sample workload Notable fixes: - raid56 - reduce parity writes, skip sectors of stripe when there are no data updates - restore reading from on-disk data instead of using stripe cache, this reduces chances to damage correct data due to RMW cycle - refuse to replay log with unknown incompat read-only feature bit set - zoned - fix page locking when COW fails in the middle of allocation - improved tracking of active zones, ZNS drives may limit the number and there are ENOSPC errors due to that limit and not actual lack of space - adjust maximum extent size for zone append so it does not cause late ENOSPC due to underreservation - mirror reading error messages show the mirror number - don't fallback to buffered IO for NOWAIT direct IO writes, we don't have the NOWAIT semantics for buffered io yet - send, fix sending link commands for existing file paths when there are deleted and created hardlinks for same files - repair all mirrors for profiles with more than 1 copy (raid1c34) - fix repair of compressed extents, unify where error detection and repair happen Core changes: - bio completion cleanups - don't double defer compression bios - simplify endio workqueues - add more data to btrfs_bio to avoid allocation for read requests - rework bio error handling so it's same what block layer does, the submission works and errors are consumed in endio - when asynchronous bio offload fails fall back to synchronous checksum calculation to avoid errors under writeback or memory pressure - new trace points - raid56 events - ordered extent operations - super block log_root_transid deprecated (never used) - mixed_backref and big_metadata sysfs feature files removed, they've been default for sufficiently long time, there are no known users and mixed_backref could be confused with mixed_groups Non-btrfs changes, API updates: - minor highmem API update to cover const arguments - switch all kmap/kmap_atomic to kmap_local - remove redundant flush_dcache_page() - address_space_operations::writepage callback removed - add bdev_max_segments() helper" * tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits) btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read btrfs: fix repair of compressed extents btrfs: remove the start argument to check_data_csum and export btrfs: pass a btrfs_bio to btrfs_repair_one_sector btrfs: simplify the pending I/O counting in struct compressed_bio btrfs: repair all known bad mirrors btrfs: merge btrfs_dev_stat_print_on_error with its only caller btrfs: join running log transaction when logging new name btrfs: simplify error handling in btrfs_lookup_dentry btrfs: send: always use the rbtree based inode ref management infrastructure btrfs: send: fix sending link commands for existing file paths btrfs: send: introduce recorded_ref_alloc and recorded_ref_free btrfs: zoned: wait until zone is finished when allocation didn't progress btrfs: zoned: write out partially allocated region btrfs: zoned: activate necessary block group btrfs: zoned: activate metadata block group on flush_space btrfs: zoned: disable metadata overcommit for zoned btrfs: zoned: introduce space_info->active_total_bytes btrfs: zoned: finish least available block group on data bg allocation btrfs: let can_allocate_chunk return error ...	2022-08-03 14:54:52 -07:00
Linus Torvalds	5264406cdb	iov_iter work, part 1 - isolated cleanups and optimizations. One of the goals is to reduce the overhead of using ->read_iter() and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}() has a surprising amount of overhead, in particular inside iocb_flags(). That's why the beginning of the series is in this pile; it's not directly iov_iter-related, but it's a part of the same work... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ 6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM= =9UCV -----END PGP SIGNATURE----- Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs iov_iter updates from Al Viro: "Part 1 - isolated cleanups and optimizations. One of the goals is to reduce the overhead of using ->read_iter() and ->write_iter() instead of ->read()/->write(). new_sync_{read,write}() has a surprising amount of overhead, in particular inside iocb_flags(). That's the explanation for the beginning of the series is in this pile; it's not directly iov_iter-related, but it's a part of the same work..." * tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: first_iovec_segment(): just return address iov_iter: massage calling conventions for first_{iovec,bvec}_segment() iov_iter: first_{iovec,bvec}_segment() - simplify a bit iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT iov_iter_bvec_advance(): don't bother with bvec_iter copy_page_{to,from}_iter(): switch iovec variants to generic keep iocb_flags() result cached in struct file iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC struct file: use anonymous union member for rcuhead and llist btrfs: use IOMAP_DIO_NOSYNC teach iomap_dio_rw() to suppress dsync No need of likely/unlikely on calls of check_copy_size()	2022-08-03 13:50:22 -07:00
Linus Torvalds	f00654007f	Folio changes for 6.0 - Fix an accounting bug that made NR_FILE_DIRTY grow without limit when running xfstests - Convert more of mpage to use folios - Remove add_to_page_cache() and add_to_page_cache_locked() - Convert find_get_pages_range() to filemap_get_folios() - Improvements to the read_cache_page() family of functions - Remove a few unnecessary checks of PageError - Some straightforward filesystem conversions to use folios - Split PageMovable users out from address_space_operations into their own movable_operations - Convert aops->migratepage to aops->migrate_folio - Remove nobh support (Christoph Hellwig) -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS 7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT /58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q== =DgUi -----END PGP SIGNATURE----- Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache Pull folio updates from Matthew Wilcox: - Fix an accounting bug that made NR_FILE_DIRTY grow without limit when running xfstests - Convert more of mpage to use folios - Remove add_to_page_cache() and add_to_page_cache_locked() - Convert find_get_pages_range() to filemap_get_folios() - Improvements to the read_cache_page() family of functions - Remove a few unnecessary checks of PageError - Some straightforward filesystem conversions to use folios - Split PageMovable users out from address_space_operations into their own movable_operations - Convert aops->migratepage to aops->migrate_folio - Remove nobh support (Christoph Hellwig) * tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits) fs: remove the NULL get_block case in mpage_writepages fs: don't call ->writepage from __mpage_writepage fs: remove the nobh helpers jfs: stop using the nobh helper ext2: remove nobh support ntfs3: refactor ntfs_writepages mm/folio-compat: Remove migration compatibility functions fs: Remove aops->migratepage() secretmem: Convert to migrate_folio hugetlb: Convert to migrate_folio aio: Convert to migrate_folio f2fs: Convert to filemap_migrate_folio() ubifs: Convert to filemap_migrate_folio() btrfs: Convert btrfs_migratepage to migrate_folio mm/migrate: Add filemap_migrate_folio() mm/migrate: Convert migrate_page() to migrate_folio() nfs: Convert to migrate_folio btrfs: Convert btree_migratepage to migrate_folio mm/migrate: Convert expected_page_refs() to folio_expected_refs() mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio() ...	2022-08-03 10:35:43 -07:00
Linus Torvalds	c013d0af81	for-5.20/block-2022-07-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR 4N1HEozvIw== =celk -----END PGP SIGNATURE----- Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Improve the type checking of request flags (Bart) - Ensure queue mapping for a single queues always picks the right queue (Bart) - Sanitize the io priority handling (Jan) - rq-qos race fix (Jinke) - Reserved tags handling improvements (John) - Separate memory alignment from file/disk offset aligment for O_DIRECT (Keith) - Add new ublk driver, userspace block driver using io_uring for communication with the userspace backend (Ming) - Use try_cmpxchg() to cleanup the code in various spots (Uros) - Finally remove bdevname() (Christoph) - Clean up the zoned device handling (Christoph) - Clean up independent access range support (Christoph) - Clean up and improve block sysfs handling (Christoph) - Clean up and improve teardown of block devices. This turns the usual two step process into something that is simpler to implement and handle in block drivers (Christoph) - Clean up chunk size handling (Christoph) - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu, Ming, Sebastian, Yang, Ying) * tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits) ublk_drv: fix double shift bug ublk_drv: make sure that correct flags(features) returned to userspace ublk_drv: fix error handling of ublk_add_dev ublk_drv: fix lockdep warning block: remove __blk_get_queue block: call blk_mq_exit_queue from disk_release for never added disks blk-mq: fix error handling in __blk_mq_alloc_disk ublk: defer disk allocation ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask ublk: fold __ublk_create_dev into ublk_ctrl_add_dev ublk: cleanup ublk_ctrl_uring_cmd ublk: simplify ublk_ch_open and ublk_ch_release ublk: remove the empty open and release block device operations ublk: remove UBLK_IO_F_PREFLUSH ublk: add a MAINTAINERS entry block: don't allow the same type rq_qos add more than once mmc: fix disk/queue leak in case of adding disk failure ublk_drv: fix an IS_ERR() vs NULL check ublk: remove UBLK_IO_F_INTEGRITY ublk_drv: remove unneeded semicolon ...	2022-08-02 13:46:35 -07:00
Matthew Wilcox (Oracle)	e7a60a1787	btrfs: Convert btrfs_migratepage to migrate_folio Use filemap_migrate_folio() to do the bulk of the work, and then copy the ordered flag across if needed. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com>	2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)	541846502f	mm/migrate: Convert migrate_page() to migrate_folio() Convert all callers to pass a folio. Most have the folio already available. Switch all users from aops->migratepage to aops->migrate_folio. Also turn the documentation into kerneldoc. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com>	2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)	8958b55142	btrfs: Convert btree_migratepage to migrate_folio Use a folio throughout this function. migrate_page() will be converted later. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com>	2022-08-02 12:34:03 -04:00
Josef Bacik	74944c8736	btrfs: reset RO counter on block group if we fail to relocate With the automatic block group reclaim code we will preemptively try to mark the block group RO before we start the relocation. We do this to make sure we should actually try to relocate the block group. However if we hit an error during the actual relocation we won't clean up our RO counter and the block group will remain RO. This was observed internally with file systems reporting less space available from df when we had failed background relocations. Fix this by doing the dec_ro in the error case. Fixes: `18bb8bbf13` ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-27 21:23:16 +02:00
Christoph Hellwig	0b078d9db8	btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read This flag was used to communicate that the low-level compression code already did verify the checksum to the high-level I/O completion code. But it has been unused for a long time as the upper btrfs_bio for the decompressed data had a NULL csum pointer basically since that pointer existed and the code already checks for that a little later. Note that this does not affect the other use of the checked flag, which is only used for the COW fixup worker. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 19:56:16 +02:00
Christoph Hellwig	81bd9328ab	btrfs: fix repair of compressed extents Currently the checksum of compressed extents is verified based on the compressed data and the lower btrfs_bio, but the actual repair process is driven by end_bio_extent_readpage on the upper btrfs_bio for the decompressed data. This has a bunch of issues, including not being able to properly communicate the failed mirror up in case that the I/O submission got preempted, a general loss of if an error was an I/O error or a checksum verification failure, but most importantly that this design causes btrfs_clean_io_failure to eventually write back the uncompressed good data onto the disk sectors that are supposed to contain compressed data. Fix this by moving the repair to the lower btrfs_bio. To do so, a fair amount of code has to be reshuffled: a) the lower btrfs_bio now needs a valid csum pointer. The easiest way to achieve that is to pass NULL btrfs_lookup_bio_sums and just use the btrfs_bio management of csums. For a compressed_bio that is split into multiple btrfs_bios this means additional memory allocations, but the code becomes a lot more regular. b) checksum verification now runs directly on the lower btrfs_bio instead of the compressed_bio. This actually nicely simplifies the end I/O processing. c) btrfs_repair_one_sector can't just look up the logical address for the file offset any more, as there is no corresponding relative offsets that apply to the file offset and the logic address for compressed extents. Instead require that the saved bvec_iter in the btrfs_bio is filled out for all read bios and use that, which again removes a fair amount of code. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 19:56:16 +02:00
Christoph Hellwig	7959bd4411	btrfs: remove the start argument to check_data_csum and export Derive the value of start from the btrfs_bio now that ->file_offset is always valid. Also export and rename the function so it's available outside of inode.c as we'll need that soon. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 19:55:32 +02:00
Christoph Hellwig	7aa51232e2	btrfs: pass a btrfs_bio to btrfs_repair_one_sector Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector, and remove the start and failed_mirror arguments in favor of deriving them from the btrfs_bio. For this to work ensure that the file_offset field is also initialized for buffered I/O. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 19:55:19 +02:00
Christoph Hellwig	524bcd1e17	btrfs: simplify the pending I/O counting in struct compressed_bio Instead of counting the sectors just count the bios, with an extra reference held during submission. This significantly simplifies the submission side error handling. This slightly changes completion and error handling of btrfs_submit_compressed_{read,write} because with the old code the compressed_bio could have been completed in submit_compressed_{read,write} only if there was an error during submission for one of the lower bio, whilst with the new code there is a chance for this to happen even for successful submission if the all the lower bios complete before the end of the function is reached. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 19:54:47 +02:00
Christoph Hellwig	c144c63fd3	btrfs: repair all known bad mirrors When there is more than a single level of redundancy there can also be multiple bad mirrors, and the current read repair code only repairs the last bad one. Restructure btrfs_repair_one_sector so that it records the originally failed mirror and the number of copies, and then repair all known bad copies until we reach the originally failed copy in clean_io_failure. Note that this also means the read repair reads will always start from the next bad mirror and not mirror 0. This fixes btrfs/265 in xfstests. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:43 +02:00
Christoph Hellwig	d28beb3e81	btrfs: merge btrfs_dev_stat_print_on_error with its only caller Fold it into the only caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Filipe Manana	723df2bcc9	btrfs: join running log transaction when logging new name When logging a new name, in case of a rename, we pin the log before changing it. We then either delete a directory entry from the log or insert a key range item to mark the old name for deletion on log replay. However when doing one of those log changes we may have another task that started writing out the log (at btrfs_sync_log()) and it started before we pinned the log root. So we may end up changing a log tree while its writeback is being started by another task syncing the log. This can lead to inconsistencies in a log tree and other unexpected results during log replay, because we can get some committed node pointing to a node/leaf that ends up not getting written to disk before the next log commit. The problem, conceptually, started to happen in commit `88d2beec7e` ("btrfs: avoid logging all directory changes during renames"), because there we started to update the log without joining its current transaction first. However the problem only became visible with commit `259c4b96d7` ("btrfs: stop doing unnecessary log updates during a rename"), and that is because we used to pin the log at btrfs_rename() and then before entering btrfs_log_new_name(), when unlinking the old dentry, we ended up at btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both of them join the current log transaction, effectively waiting for any log transaction writeout (due to acquiring the root's log_mutex). This made it safe even after leaving the current log transaction, because we remained with the log pinned when we called btrfs_log_new_name(). Then in commit `259c4b96d7` ("btrfs: stop doing unnecessary log updates during a rename"), we removed the log pinning from btrfs_rename() and stopped calling btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log() during the rename, and started to do all the needed work at btrfs_log_new_name(), but without joining the current log transaction, only pinning the log, which is racy because another task may have started writeout of the log tree right before we pinned the log. Both commits landed in kernel 5.18, so it doesn't make any practical difference which should be blamed, but I'm blaming the second commit only because with the first one, by chance, the problem did not happen due to the fact we joined the log transaction after pinning the log and unpinned it only after calling btrfs_log_new_name(). So make btrfs_log_new_name() join the current log transaction instead of pinning it, so that we never do log updates if it's writeout is starting. Fixes: `259c4b96d7` ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Nikolay Borisov	fc8b235fdc	btrfs: simplify error handling in btrfs_lookup_dentry In btrfs_lookup_dentry releasing the reference of the sub_root and the running orphan cleanup should only happen if the dentry found actually represents a subvolume. This can only be true in the 'else' branch as otherwise either fixup_tree_root_location returned an ENOENT error, in which case sub_root wouldn't have been changed or if we got a different errno this means btrfs_get_fs_root couldn't have executed successfully again meaning sub_root will equal to root. So simplify all the branches by moving the code into the 'else'. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Filipe Manana	0d8869fb6b	btrfs: send: always use the rbtree based inode ref management infrastructure After the patch "btrfs: send: fix sending link commands for existing file paths", we now have two infrastructures to detect and eliminate duplicated inode references (due to names that got removed and re-added between the send and parent snapshots): 1) One that works on a single inode ref/extref item; 2) A new one that works acrosss all ref/extref items for an inode, and it's also more efficient because even in the single ref/extref item case, it does not do a linear search for all the names encoded in the ref/extref item, it uses red black trees to speedup up the search. There's no good reason to keep both infrastructures, we can use the new one everywhere, and it's always more efficient. So remove the old infrastructure and change all sites that are using it to use the new one. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
BingJing Chang	3aa5bd367f	btrfs: send: fix sending link commands for existing file paths There is a bug sending link commands for existing file paths. When we're processing an inode, we go over all references. All the new file paths are added to the "new_refs" list. And all the deleted file paths are added to the "deleted_refs" list. In the end, when we finish processing the inode, we iterate over all the items in the "new_refs" list and send link commands for those file paths. After that, we go over all the items in the "deleted_refs" list and send unlink commands for them. If there are duplicated file paths in both lists, we will try to create them before we remove them. Then the receiver gets an -EEXIST error when trying the link operations. Example for having duplicated file paths in both list: $ btrfs subvolume create vol # create a file and 2000 hard links to the same inode $ touch vol/foo $ for i in {1..2000}; do link vol/foo vol/$i ; done # take a snapshot for a parent snapshot $ btrfs subvolume snapshot -r vol snap1 # remove 2000 hard links and re-create the last 1000 links $ for i in {1..2000}; do rm vol/$i; done; $ for i in {1001..2000}; do link vol/foo vol/$i; done # take another one for a send snapshot $ btrfs subvolume snapshot -r vol snap2 $ mkdir receive_dir $ btrfs send snap2 -p snap1 \| btrfs receive receive_dir/ At subvol snap2 link 1238 -> foo ERROR: link 1238 -> foo failed: File exists In this case, we will have the same file paths added to both lists. In the parent snapshot, reference paths {1..1237} are stored in inode references, but reference paths {1238..2000} are stored in inode extended references. In the send snapshot, all reference paths {1001..2000} are stored in inode references. During the incremental send, we process their inode references first. In record_changed_ref(), we iterate all its inode references in the send/parent snapshot. For every inode reference, we also use find_iref() to check whether the same file path also appears in the parent/send snapshot or not. Inode references {1238..2000} which appear in the send snapshot but not in the parent snapshot are added to the "new_refs" list. On the other hand, Inode references {1..1000} which appear in the parent snapshot but not in the send snapshot are added to the "deleted_refs" list. Next, when we process their inode extended references, reference paths {1238..2000} are added to the "deleted_refs" list because all of them only appear in the parent snapshot. Now two lists contain items as below: "new_refs" list: {1238..2000} "deleted_refs" list: {1..1000}, {1238..2000} Reference paths {1238..2000} appear in both lists. And as the processing order mentioned about before, the receiver gets an -EEXIST error when trying the link operations. To fix the bug, the idea is to process the "deleted_refs" list before the "new_refs" list. However, it's not easy to reshuffle the processing order. For one reason, if we do so, we may unlink all the existing paths first, there's no valid path anymore for links. And it's inefficient because we do a bunch of unlinks followed by links for the same paths. Moreover, it makes less sense to have duplications in both lists. A reference path cannot not only be regarded as new but also has been seen in the past, or we won't call it a new path. However, it's also not a good idea to make find_iref() check a reference against all inode references and all inode extended references because it may result in large disk reads. So we introduce two rbtrees to make the references easier for lookups. And we also introduce record_new_ref_if_needed() and record_deleted_ref_if_needed() for changed_ref() to check and remove duplicated references early. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
BingJing Chang	71ecfc133b	btrfs: send: introduce recorded_ref_alloc and recorded_ref_free Introduce wrappers to allocate and free recorded_ref structures. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Naohiro Aota	2ce543f478	btrfs: zoned: wait until zone is finished when allocation didn't progress When the allocated position doesn't progress, we cannot submit IOs to finish a block group, but there should be ongoing IOs that will finish a block group. So, in that case, we wait for a zone to be finished and retry the allocation after that. Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to indicate we need a zone finish to have proceeded. The flag is set when the allocator detected it cannot activate a new block group. And, it is cleared once a zone is finished. CC: stable@vger.kernel.org # 5.16+ Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Naohiro Aota	898793d992	btrfs: zoned: write out partially allocated region cow_file_range() works in an all-or-nothing way: if it fails to allocate an extent for a part of the given region, it gives up all the region including the successfully allocated parts. On cow_file_range(), run_delalloc_zoned() writes data for the region only when it successfully allocate all the region. This all-or-nothing allocation and write-out are problematic when available space in all the block groups are get tight with the active zone restriction. btrfs_reserve_extent() try hard to utilize the left space in the active block groups and gives up finally and fails with -ENOSPC. However, if we send IOs for the successfully allocated region, we can finish a zone and can continue on the rest of the allocation on a newly allocated block group. This patch implements the partial write-out for run_delalloc_zoned(). With this patch applied, cow_file_range() returns -EAGAIN to tell the caller to do something to progress the further allocation, and tells the successfully allocated region with done_offset. Furthermore, the zoned extent allocator returns -EAGAIN to tell cow_file_range() going back to the caller side. Actually, we still need to wait for an IO to complete to continue the allocation. The next patch implements that part. CC: stable@vger.kernel.org # 5.16+ Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Naohiro Aota	b6a98021e4	btrfs: zoned: activate necessary block group There are two places where allocating a chunk is not enough. These two places are trying to ensure the space by allocating a chunk. To meet the condition for active_total_bytes, we also need to activate a block group there. CC: stable@vger.kernel.org # 5.16+ Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Naohiro Aota	b093151391	btrfs: zoned: activate metadata block group on flush_space For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE} means we don't have enough space left in the active_total_bytes. Before allocating a new chunk, we can try to activate an existing block group in this case. Also, allocating a chunk is not enough to grant a ticket for metadata space on zoned filesystem we need to activate the block group to increase the active_total_bytes. btrfs_zoned_activate_one_bg() implements the activation feature. It will activate a block group by (maybe) finishing a block group. It will give up activating a block group if it cannot finish any block group. CC: stable@vger.kernel.org # 5.16+ Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Naohiro Aota	79417d040f	btrfs: zoned: disable metadata overcommit for zoned The metadata overcommit makes the space reservation flexible but it is also harmful to active zone tracking. Since we cannot finish a block group from the metadata allocation context, we might not activate a new block group and might not be able to actually write out the overcommit reservations. So, disable metadata overcommit for zoned filesystems. We will ensure the reservations are under active_total_bytes in the following patches. CC: stable@vger.kernel.org # 5.16+ Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Naohiro Aota	6a921de589	btrfs: zoned: introduce space_info->active_total_bytes The active_total_bytes, like the total_bytes, accounts for the total bytes of active block groups in the space_info. With an introduction of active_total_bytes, we can check if the reserved bytes can be written to the block groups without activating a new block group. The check is necessary for metadata allocation on zoned filesystem. We cannot finish a block group, which may require waiting for the current transaction, from the metadata allocation context. Instead, we need to ensure the ongoing allocation (reserved bytes) fits in active block groups. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:42 +02:00
Naohiro Aota	393f646e34	btrfs: zoned: finish least available block group on data bg allocation When we run out of active zones and no sufficient space is left in any block groups, we need to finish one block group to make room to activate a new block group. However, we cannot do this for metadata block groups because we can cause a deadlock by waiting for a running transaction commit. So, do that only for a data block group. Furthermore, the block group to be finished has two requirements. First, the block group must not have reserved bytes left. Having reserved bytes means we have an allocated region but did not yet send bios for it. If that region is allocated by the thread calling btrfs_zone_finish(), it results in a deadlock. Second, the block group to be finished must not be a SYSTEM block group. Finishing a SYSTEM block group easily breaks further chunk allocation by nullifying the SYSTEM free space. In a certain case, we cannot find any zone finish candidate or btrfs_zone_finish() may fail. In that case, we fall back to split the allocation bytes and fill the last spaces left in the block groups. CC: stable@vger.kernel.org # 5.16+ Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Naohiro Aota	bb9950d3df	btrfs: let can_allocate_chunk return error For the later patch, convert the return type from bool to int and return errors. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Naohiro Aota	d760156601	btrfs: use fs_info->max_extent_size in get_extent_max_capacity() Use fs_info->max_extent_size also in get_extent_max_capacity() for the completeness. This is only used for defrag and not really necessary to fix the metadata reservation size. But, it still suppresses unnecessary defrag operations. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Naohiro Aota	7d7672bc5d	btrfs: convert count_max_extents() to use fs_info->max_extent_size If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number of extents needed, btrfs release the metadata reservation too much on its way to write out the data. Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size, convert count_max_extents() to use it instead, and fix the calculation of the metadata reservation. CC: stable@vger.kernel.org # 5.12+ Fixes: `d8e3fb106f` ("btrfs: zoned: use ZONE_APPEND write for zoned mode") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Naohiro Aota	f7b12a62f0	btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size On zoned filesystem, data write out is limited by max_zone_append_size, and a large ordered extent is split according the size of a bio. OTOH, the number of extents to be written is calculated using BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the metadata bytes to update and/or create the metadata items. The metadata reservation is done at e.g, btrfs_buffered_write() and then released according to the estimation changes. Thus, if the number of extent increases massively, the reserved metadata can run out. The increase of the number of extents easily occurs on zoned filesystem if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the following warning on a small RAM environment with disabling metadata over-commit (in the following patch). [75721.498492] ------------[ cut here ]------------ [75721.505624] BTRFS: block rsv 1 returned -28 [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs] [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109 [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021 [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs] [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286 [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000 [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7 [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28 [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a [75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000 [75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0 [75721.730499] Call Trace: [75721.735166] <TASK> [75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs] [75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs] [75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs] [75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs] [75721.769520] ? push_leaf_left+0x420/0x620 [btrfs] [75721.776431] ? memcpy+0x4e/0x60 [75721.781931] split_leaf+0x433/0x12d0 [btrfs] [75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs] [75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs] [75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs] [75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs] [75721.818300] ? lock_downgrade+0x7c0/0x7c0 [75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs] [75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs] [75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs] [75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs] [75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs] [75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs] [75721.869313] ? rcu_read_lock_sched_held+0x16/0x80 [75721.876085] ? lock_release+0x552/0xf80 [75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs] [75721.888886] ? __kasan_check_write+0x14/0x20 [75721.895152] ? do_raw_read_unlock+0x44/0x80 [75721.901323] ? _raw_write_lock_irq+0x60/0x80 [75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs] [75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs] [75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs] [75721.929166] ? _raw_write_unlock+0x23/0x40 [75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs] [75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs] [75721.949906] ? try_to_wake_up+0x30/0x14a0 [75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs] [75721.962661] ? rcu_read_lock_sched_held+0x16/0x80 [75721.969111] ? lock_acquire+0x41b/0x4c0 [75721.974982] finish_ordered_fn+0x15/0x20 [btrfs] [75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs] [75721.988184] ? _raw_spin_unlock_irq+0x28/0x50 [75721.994643] process_one_work+0x815/0x1460 [75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250 [75722.006643] ? do_raw_spin_trylock+0xbb/0x190 [75722.013086] worker_thread+0x59a/0xeb0 [75722.018511] kthread+0x2ac/0x360 [75722.023428] ? process_one_work+0x1460/0x1460 [75722.029431] ? kthread_complete_and_exit+0x30/0x30 [75722.036044] ret_from_fork+0x22/0x30 [75722.041255] </TASK> [75722.045047] irq event stamp: 0 [75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0 [75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0 [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0 [75722.085335] ---[ end trace 0000000000000000 ]--- To fix the estimation, we need to introduce fs_info->max_extent_size to replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for regular vs zoned filesystem. Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned filesystem, it is set to fs_info->max_zone_append_size. CC: stable@vger.kernel.org # 5.12+ Fixes: `d8e3fb106f` ("btrfs: zoned: use ZONE_APPEND write for zoned mode") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Naohiro Aota	c2ae7b772e	btrfs: zoned: revive max_zone_append_bytes This patch is basically a revert of commit `5a80d1c6a2` ("btrfs: zoned: remove max_zone_append_size logic"), but without unnecessary ASSERT and check. The max_zone_append_size will be used as a hint to estimate the number of extents to cover delalloc/writeback region in the later commits. The size of a ZONE APPEND bio is also limited by queue_max_segments(), so this commit considers it to calculate max_zone_append_size. Technically, a bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE" as an upper limit of an extent size to calculate the number of extents needed to write data. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Filipe Manana	cf2404a976	btrfs: add optimized btrfs_ino() version for 64 bits systems Currently btrfs_ino() tries to use first the objectid of the inode's location key. This is to avoid truncation of the inode number on 32 bits platforms because the i_ino field of struct inode has the unsigned long type, while the objectid is a 64 bits unsigned type (u64) on every system. This logic was added in commit `33345d0152` ("Btrfs: Always use 64bit inode number"). However if we are running on a 64 bits system, we can always directly return the i_ino value from struct inode, which eliminates the need for he special if statement that tests for a location key type of BTRFS_ROOT_ITEM_KEY - in which case i_ino may not have the same value as the objectid in the inode's location objectid, it may have a value of BTRFS_EMPTY_SUBVOL_DIR_OBJECTID, for the case of snapshots of trees with subvolumes/snapshots inside them. So add a special version for 64 bits system that directly returns i_ino of struct inode. This eliminates one branch and reduces the overall code size, since btrfs_ino() is an inline function that is extensively used. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1617487 189240 29032 1835759 1c02ef fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1612028 189180 29032 1830240 1bed60 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Filipe Manana	adac558416	btrfs: set the objectid of the btree inode's location key We currently don't use the location key of the btree inode, its content is set to zeroes, as it's a special inode that is not persisted (it has no inode item stored in any btree). At btrfs_ino(), an inline function used extensively in btrfs, we have this special check if the given inode's location objectid is 0, and if it is, we return the value stored in the VFS' inode i_ino field instead (which is BTRFS_BTREE_INODE_OBJECTID for the btree inode). To reduce the code at btrfs_ino(), we can simply set the objectid of the btree inode to the value BTRFS_BTREE_INODE_OBJECTID. This eliminates the need to check for the special case of the objectid being zero, with the side effect of reducing the overall code size and having less code to execute, as btrfs_ino() is an inline function. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1617487 189240 29032 1835759 1c02ef fs/btrfs/btrfs.ko Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Fabio M. De Francesco	4cb2e5e8fe	btrfs: replace kmap_atomic() with kmap_local_page() kmap_atomic() is being deprecated in favor of kmap_local_page() where it is feasible. With kmap_local_page() mappings are per thread, CPU local, and not globally visible. The last use of kmap_atomic is in inode.c where the context is atomic [1] and can be safely replaced by kmap_local_page. Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB RAM and booting a kernel with HIGHMEM64GB enabled. [1] https://lore.kernel.org/linux-btrfs/20220601132545.GM20633@twin.jikos.cz/ Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Fabio M. De Francesco	5a6e6e7ce6	btrfs: zlib: replace kmap() with kmap_local_page() in zlib_decompress_bio() The use of kmap() is being deprecated in favor of kmap_local_page(). With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in zlib_decompress_bio() because in this function the mappings are per thread and are not visible in other contexts. Tested with xfstests on QEMU + KVM 32-bits VM with 4GB of RAM and HIGHMEM64G enabled. This patch passes 26/26 tests of group "compress". Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:41 +02:00
Fabio M. De Francesco	718e58551a	btrfs: zlib: replace kmap() with kmap_local_page() in zlib_compress_pages() The use of kmap() is being deprecated in favor of kmap_local_page(). With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in zlib_compress_pages() because in this function the mappings are per thread and are not visible in other contexts. Furthermore, drop the mappings of "out_page" which is allocated within zlib_compress_pages() with alloc_page(GFP_NOFS) and use page_address(). Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB of RAM booting a kernel with HIGHMEM64G enabled. This patch passes 26/26 tests of group "compress". CC: Qu Wenruo <wqu@suse.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Fabio M. De Francesco	ebd23482ef	btrfs: zstd: replace kmap() with kmap_local_page() The use of kmap() is being deprecated in favor of kmap_local_page(). With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in zstd.c because in this file the mappings are per thread and are not visible in other contexts. In the meanwhile use plain page_address() on output pages allocated with the GFP_NOFS flag instead of calling kmap*() on them (since they are always allocated from ZONE_NORMAL). Tested with xfstests on QEMU + KVM 32 bits VM with 4GB of RAM, booting a kernel with HIGHMEM64G enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Filipe Manana	ac5e666951	btrfs: don't fallback to buffered IO for NOWAIT direct IO writes Currently, for a direct IO write, if we need to fallback to buffered IO, either to satisfy the whole write operation or just a part of it, we do it in the current context even if it's a NOWAIT context. This is not ideal because we currently don't have support for NOWAIT semantics in the buffered IO path (we can block for several reasons), so we should instead return -EAGAIN to the caller, so that it knows it should retry (the whole operation or what's left of it) in a context where blocking is acceptable. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
David Sterba	8bfc9b2cf4	btrfs: use enum for btrfs_block_rsv::type The number of block group reserve types BTRFS_BLOCK_RSV_* is small and fits to u8 and there's enough left in case we want to add more. For type safety use the enum but make it 8 bits in the structure to save space. The structure size is now 48 on release build, making a slight improvement in structures where it's embedded, like btrfs_fs_info or btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
David Sterba	710d5921d1	btrfs: switch btrfs_block_rsv::failfast to bool Use simple bool type for the block reserve failfast status, there's short to save space as there used to be int but there's no reason for that. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
David Sterba	c70c2c5bc9	btrfs: switch btrfs_block_rsv::full to bool Use simple bool type for the block reserve full status, there's short to save space as there used to be int but there's no reason for that. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	37899117e8	btrfs: do not return errors from btrfs_submit_dio_bio Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission and the other btrfs bio submission handlers do and avoids any confusion on who needs to handle errors. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	ea1f0cedef	btrfs: handle allocation failure in btrfs_wq_submit_bio gracefully btrfs_wq_submit_bio is used for writeback under memory pressure. Instead of failing the I/O when we can't allocate the async_submit_bio, just punt back to the synchronous submission path. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	82443fd55c	btrfs: simplify sync/async submission in btrfs_submit_data_write_bio btrfs_submit_data_write_bio special cases the reloc root because the checksums are preloaded, but only does so for the !sync case. The sync case can't happen for data relocation, but just handling it more generally significantly simplifies the logic. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	b9af128d1e	btrfs: raid56: transfer the bio counter reference to the raid submission helpers Transfer the bio counter reference acquired by btrfs_submit_bio to raid56_parity_write and raid56_parity_recovery together with the bio that the reference was acquired for instead of acquiring another reference in those helpers and dropping the original one in btrfs_submit_bio. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:40 +02:00
Christoph Hellwig	6065fd95da	btrfs: do not return errors from raid56_parity_recover Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Also use the proper bool type for the generic_io argument. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Christoph Hellwig	31683f4aae	btrfs: do not return errors from raid56_parity_write Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Christoph Hellwig	1a722d8f5b	btrfs: do not return errors from btrfs_map_bio Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. As this requires touching all the callers, rename the function to btrfs_submit_bio, which describes the functionality much better. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Qu Wenruo	462b0b2a86	btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block() For profiles other than RAID56, __btrfs_map_block() returns @map_length as min(stripe_end, logical + *length), which is also the same result from btrfs_get_io_geometry(). But for RAID56, __btrfs_map_block() returns @map_length as stripe_len. This strange behavior is going to hurt incoming bio split at btrfs_map_bio() time, as we will use @map_length as bio split size. Fix this behavior by returning @map_length by the same calculation as for other profiles. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Christoph Hellwig	ff18a4afeb	btrfs: raid56: use fixed stripe length everywhere The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there are functions passing it as arguments, this is not necessary. The fixed value has been used for a long time and though the stripe length should be configurable by super block member stripesize, this hasn't been implemented and would require more changes so we don't need to keep this code around until then. Partially based on a patch from Qu Wenruo. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Filipe Manana	0201fceb9f	btrfs: remove the inode cache check at btrfs_is_free_space_inode() The inode cache feature was removed in kernel 5.11, and we no longer have any code that reads from or writes to inode caches. We may still mount a filesystem that has inode caches, but they are ignored. Remove the check for an inode cache from btrfs_is_free_space_inode(), since we no longer have code to trigger reads from an inode cache or writes to an inode cache. The check at send.c is still needed, because in case we find a filesystem with an inode cache, we must ignore it. Also leave the checks at tree-checker.c, as they are sanity checks. This eliminates a dead branch and reduces the amount of code since it's in an inline function. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620662 189240 29032 1838934 1c0f56 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Nikolay Borisov	74860816e8	btrfs: sysfs: remove BIG_METADATA feature files This flag has been merged in 3.10 and is effectively always-on. Its status depends on the host page size so there's another way to guarantee compatibility with old kernels. Due to a bug introduced in `6f93e834fa` ("btrfs: fix upper limit for max_inline for page size 64K") the flag is not persisted among features in the superblock so it's not reliable. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Nikolay Borisov	0766837b0d	btrfs: sysfs: remove MIXED_BACKREF feature file This feature has been the default for about 13 year. At this point it's safe to consider it an indispensable feature of BTRFS as such there's no need to advertise it in sysfs. Remove the global sysfs feature file, the per-filesystem feature file has never been there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Nikolay Borisov	49f468c938	btrfs: don't print 'has skinny extents' anymore on mount Skinny extents have been a default mkfs feature since version 3.18 i (introduced in btrfs-progs commit 6715de04d9a7 ("btrfs-progs: mkfs: make skinny-metadata default") ). It really doesn't bring any value to users to simply remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
Nikolay Borisov	6b769dac21	btrfs: don't print 'flagging with big metadata' anymore on mount Added in commit `727011e07c` ("Btrfs: allow metadata blocks larger than the page size") in 2010 and it's been default for mkfs since 3.12 (2013). The message doesn't really convey any useful information to users. Remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
David Sterba	c1867eb33e	btrfs: clean up chained assignments The chained assignments may be convenient to write, but make readability a bit worse as it's too easy to overlook that there are several values set on the same line while this is rather an exception. Making it consistent everywhere avoids surprises. The pattern where inode times are initialized reuses the first value and the order is mtime, ctime. In other blocks the assignments are expanded so the order of variables is similar to the neighboring code. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:39 +02:00
David Sterba	ac0677348f	btrfs: merge calculations for simple striped profiles in btrfs_rmap_block Use the same expression for stripe_nr for RAID0 (map->sub_stripes is 1) and RAID10 (map->sub_stripes is 2), with equivalent results. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
David Sterba	d09cb9e188	btrfs: use mask for all RAID1* profiles in btrfs_calc_avail_data_space There's a sequence of hard coded values for RAID1 profiles that are already stored in the raid_attr table that should be used instead. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Nikolay Borisov	e26b04c4c9	btrfs: properly flag filesystem with BTRFS_FEATURE_INCOMPAT_BIG_METADATA Commit `6f93e834fa` seemingly inadvertently moved the code responsible for flagging the filesystem as having BIG_METADATA to a place where setting the flag was essentially lost. This means that filesystems created with kernels containing this bug (starting with 5.15) can potentially be mounted by older (pre-3.4) kernels. In reality chances for this happening are low because there are other incompat flags introduced in the mean time. Still the correct behavior is to set INCOMPAT_BIG_METADATA flag and persist this in the superblock. Fixes: `6f93e834fa` ("btrfs: fix upper limit for max_inline for page size 64K") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
David Sterba	c8a5f8ca9a	btrfs: print checksum type and implementation at mount time Per user request, print the checksum type and implementation at mount time among the messages. The checksum is user configurable and the actual crypto implementation is useful to see for performance reasons. The same information is also available after mount in /sys/fs/FSID/checksum file. Example: [25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm Link: https://github.com/kdave/btrfs-progs/issues/483 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Josef Bacik	1314ca78b2	btrfs: reset block group chunk force if we have to wait If you try to force a chunk allocation, but you race with another chunk allocation, you will end up waiting on the chunk allocation that just occurred and then allocate another chunk. If you have many threads all doing this at once you can way over-allocate chunks. Fix this by resetting force to NO_FORCE, that way if we think we need to allocate we can, otherwise we don't force another chunk allocation if one is already happening. Reviewed-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
David Sterba	4824735918	btrfs: send: add new command FILEATTR for file attributes There are file attributes inherited from previous ext2 SETFLAGS/GETFLAGS and later from XFLAGS interfaces, now commonly found under the 'fileattr' API. This corresponds to the individual inode bits and that's part of the on-disk format, so this is suitable for the protocol. The other interfaces contain a lot of cruft or bits that btrfs does not support yet. Currently the value is u64 and matches btrfs_inode_item. Not all the bits can be set by ioctls (like NODATASUM or READONLY), but we can send them over the protocol and leave it up to the receiving side what and how to apply. As some of the flags, eg. IMMUTABLE, can prevent any further changes, the receiving side needs to understand that and apply the changes in the right order, or possibly with some intermediate steps. This should be easier, future proof and simpler on the protocol layer than implementing in kernel. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
David Sterba	22a5b2abb7	btrfs: send: add OTIME as utimes attribute for proto 2+ by default When send v1 was introduced the otime (inode creation time) was not available, however the attribute in btrfs send protocol exists. Though it would be possible to add it for v1 too as the attribute would be ignored by v1 receive, let's not change the layout of v1 and only add that to v2+. The otime cannot be changed and is only informative. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Qu Wenruo	8f0ed7d4e7	btrfs: output mirror number for bad metadata When handling a real world transid mismatch image, it's hard to know which copy is corrupted, as the error messages just look like this: BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 We don't even know if the retry is caused by btrfs or the VFS retry. To make things a little easier to read, add mirror number for all related tree block read errors. So the above messages would look like this: BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 Signed-off-by: Qu Wenruo <wqu@suse.com> [ update messages, add "logical" ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Naohiro Aota	aaafa1ebd6	btrfs: replace unnecessary goto with direct return at cow_file_range() The 'goto out' in cow_file_range() in the exit block are not necessary and jump back. Replace them with return, while still keeping 'goto out' in the main code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ keep goto in the main code, update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Naohiro Aota	71aa147b4d	btrfs: fix error handling of fallback uncompress write When cow_file_range() fails in the middle of the allocation loop, it unlocks the pages but leaves the ordered extents intact. Thus, we need to call btrfs_cleanup_ordered_extents() to finish the created ordered extents. Also, we need to call end_extent_writepage() if locked_page is available because btrfs_cleanup_ordered_extents() never processes the region on the locked_page. Furthermore, we need to set the mapping as error if locked_page is unavailable before unlocking the pages, so that the errno is properly propagated to the user space. CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Naohiro Aota	99826e4cab	btrfs: extend btrfs_cleanup_ordered_extents for NULL locked_page btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it is not usable for submit_uncompressed_range() which can have NULL locked_page. Add support supports locked_page == NULL case. Also, it rewrites redundant "page_offset(locked_page)". Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Naohiro Aota	9ce7466f37	btrfs: ensure pages are unlocked on cow_file_range() failure There is a hung_task report on zoned btrfs like below. https://github.com/naota/linux/issues/59 [726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds. [726.329839] Not tainted 5.16.0-rc1+ #1 [726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [726.331603] task:rocksdb:high0 state:D stack: 0 pid:11085 ppid: 11082 flags:0x00000000 [726.331608] Call Trace: [726.331611] <TASK> [726.331614] __schedule+0x2e5/0x9d0 [726.331622] schedule+0x58/0xd0 [726.331626] io_schedule+0x3f/0x70 [726.331629] __folio_lock+0x125/0x200 [726.331634] ? find_get_entries+0x1bc/0x240 [726.331638] ? filemap_invalidate_unlock_two+0x40/0x40 [726.331642] truncate_inode_pages_range+0x5b2/0x770 [726.331649] truncate_inode_pages_final+0x44/0x50 [726.331653] btrfs_evict_inode+0x67/0x480 [726.331658] evict+0xd0/0x180 [726.331661] iput+0x13f/0x200 [726.331664] do_unlinkat+0x1c0/0x2b0 [726.331668] __x64_sys_unlink+0x23/0x30 [726.331670] do_syscall_64+0x3b/0xc0 [726.331674] entry_SYSCALL_64_after_hwframe+0x44/0xae [726.331677] RIP: 0033:0x7fb9490a171b [726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057 [726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b [726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300 [726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000 [726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000 [726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260 [726.331693] </TASK> While we debug the issue, we found running fstests generic/551 on 5GB non-zoned null_blk device in the emulated zoned mode also had a similar hung issue. Also, we can reproduce the same symptom with an error injected cow_file_range() setup. The hang occurs when cow_file_range() fails in the middle of allocation. cow_file_range() called from do_allocation_zoned() can split the give region ([start, end]) for allocation depending on current block group usages. When btrfs can allocate bytes for one part of the split regions but fails for the other region (e.g. because of -ENOSPC), we return the error leaving the pages in the succeeded regions locked. Technically, this occurs only when @unlock == 0. Otherwise, we unlock the pages in an allocated region after creating an ordered extent. Considering the callers of cow_file_range(unlock=0) won't write out the pages, we can unlock the pages on error exit from cow_file_range(). So, we can ensure all the pages except @locked_page are unlocked on error case. In summary, cow_file_range now behaves like this: - page_started == 1 (return value) - All the pages are unlocked. IO is started. - unlock == 1 - All the pages except @locked_page are unlocked in any case - unlock == 0 - On success, all the pages are locked for writing out them - On failure, all the pages except @locked_page are unlocked Fixes: `42c0110009` ("btrfs: zoned: introduce dedicated data write path for zoned filesystems") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:38 +02:00
Ioannis Angelakopoulos	140a8ff765	btrfs: sysfs: export commit stats Export commit stats in file /sys/fs/btrfs/UUID/commit_stats with example output like: commits 123 last_commit_ms 11 max_commit_ms 150 total_commit_ms 2000 The values are in one file so reading them at a single time will give a more consistent view. The stats are internally tracked in nanoseconds so the cumulative values should not suffer from rounding errors. Writing 0 to the file 'commit_stats' will reset max_commit_ms. Initial values are set at first mount of the filesystem. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Ioannis Angelakopoulos	e55958c8a0	btrfs: collect commit stats, count, duration Track several stats about transaction commit, to be later exported via sysfs: - number of commits so far - duration of the last commit in ns - maximum commit duration seen so far in ns - total duration for all commits so far in ns The update of the commit stats occurs after the commit thread has gone through all the logic that checks if there is another thread committing at the same time. This means that we only account for actual commit work in the commit stats we report and not the time the thread spends waiting until it is ready to do the commit work. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Christoph Hellwig	f3e90c1ca9	btrfs: remove extent writepage address space operation Same as in commit `21b4ee7029` ("xfs: drop ->writepage completely"): we can remove the callback as it's only used in one place - single page writeback from memory reclaim and is not called for cgroup writeback at all. We only allow such writeback from kswapd, not from direct memory reclaim, and so it is rarely used. When it comes from kswapd, it is effectively random dirty page shoot-down, which is horrible for IO patterns. We can rely on background writeback to clean all dirty pages in an efficient way and not let it be interrupted by kswapd. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
David Sterba	9555e1f188	btrfs: send: use boolean types for current inode status The new, new_gen and deleted indicate a status, use boolean type instead of int. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
David Sterba	cec3dad943	btrfs: send: remove old TODO regarding ERESTARTSYS The whole send operation is restartable and handling properly a buffer write may not be easy. We can't know what caused that and if a short delay and retry will fix it or how many retries should be performed in case it's a temporary condition. The error value is returned to the ioctl caller so in case it's transient problem, the user would be notified about the reason. Remove the TODO note as there's no plan to handle ERESTARTSYS. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
David Sterba	8234d3f658	btrfs: send: simplify includes We don't need the whole ctree.h in send.h, none of the data types defined there are used. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
David Sterba	e3b4b9040b	btrfs: send: drop __KERNEL__ ifdef from send.h We don't need this ifdef as the header file is not shared, the protocol definition used by userspace should be from libbtrfs or libbtrfsutil. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Christoph Hellwig	ee5b46a353	btrfs: increase direct io read size limit to 256 sectors Btrfs currently limits direct I/O reads to a single sector, which goes back to commit `c329861da4` ("Btrfs: don't allocate a separate csums array for direct reads") from Josef. That commit changes the direct I/O code to ".. use the private part of the io_tree for our csums.", but ten years later that isn't how checksums for direct reads work, instead they use a csums allocation on a per-btrfs_dio_private basis (which have their own performance problem for small I/O, but that will be addressed later). There is no fundamental limit in btrfs itself to limit the I/O size except for the size of the checksum array that scales linearly with the number of sectors in an I/O. Pick a somewhat arbitrary limit of 256 limits, which matches what the buffered reads typically see as the upper limit as the limit for direct I/O as well. This significantly improves direct read performance. For example a fio run doing 1 MiB aio reads with a queue depth of 1 roughly triples the throughput: Baseline: READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec With this patch: READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Qu Wenruo	f6065f8ede	btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() [BUG] There is a small workload which will always fail with recent kernel: (A simplified version from btrfs/125 test case) mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1 sync umount $mnt btrfs dev scan -u $dev3 mount -o degraded $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2 umount $mnt btrfs dev scan mount $dev1 $mnt btrfs balance start --full-balance $mnt umount $mnt The failure is always failed to read some tree blocks: BTRFS info (device dm-4): relocating block group 217710592 flags data\|raid5 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 ... [CAUSE] With the recently added debug output, we can see all RAID56 operations related to full stripe 38928384: 56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384 56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384 56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384 56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384 56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384 56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 Before we enter raid56_parity_recover(), we have triggered some metadata write for the full stripe 38928384, this leads to us to read all the sectors from disk. Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to avoid unnecessary read. This means, for that full stripe, after any partial write, we will have stale data, along with P/Q calculated using that stale data. Thankfully due to patch "btrfs: only write the sectors in the vertical stripe which has data stripes" we haven't submitted all the corrupted P/Q to disk. When we really need to recover certain range, aka in raid56_parity_recover(), we will use the cached rbio, along with its cached sectors (the full stripe is all cached). This explains why we have no event raid56_scrub_read_recover() triggered. Since we have the cached P/Q which is calculated using the stale data, the recovered one will just be stale. In our particular test case, it will always return the same incorrect metadata, thus causing the same error message "parent transid verify failed on 39010304 wanted 9 found 7" again and again. [BTRFS DESTRUCTIVE RMW PROBLEM] Test case btrfs/125 (and above workload) always has its trouble with the destructive read-modify-write (RMW) cycle: 0 32K 64K Data1: \| Good \| Good \| Data2: \| Bad \| Bad \| Parity: \| Good \| Good \| In above case, if we trigger any write into Data1, we will use the bad data in Data2 to re-generate parity, killing the only chance to recovery Data2, thus Data2 is lost forever. This destructive RMW cycle is not specific to btrfs RAID56, but there are some btrfs specific behaviors making the case even worse: - Btrfs will cache sectors for unrelated vertical stripes. In above example, if we're only writing into 0~32K range, btrfs will still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2. This behavior is to cache sectors for later update. Incidentally commit `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") has a bug which makes RAID56 to never trust the cached sectors, thus slightly improve the situation for recovery. Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in steal_rbio" will revert the behavior back to the old one. - Btrfs raid56 partial write will update all P/Q sectors and cache them This means, even if data at (64K ~ 96K) of Data2 is free space, and only (96K ~ 128K) of Data2 is really stale data. And we write into that (96K ~ 128K), we will update all the parity sectors for the full stripe. This unnecessary behavior will completely kill the chance of recovery. Thankfully, an unrelated optimization "btrfs: only write the sectors in the vertical stripe which has data stripes" will prevent submitting the write bio for untouched vertical sectors. That optimization will keep the on-disk P/Q untouched for a chance for later recovery. [FIX] Although we have no good way to completely fix the destructive RMW (unless we go full scrub for each partial write), we can still limit the damage. With patch "btrfs: only write the sectors in the vertical stripe which has data stripes" now we won't really submit the P/Q of unrelated vertical stripes, so the on-disk P/Q should still be fine. Now we really need to do is just drop all the cached sectors when doing recovery. By this, we have a chance to read the original P/Q from disk, and have a chance to recover the stale data, while still keep the cache to speed up regular write path. In fact, just dropping all the cache for recovery path is good enough to allow the test case btrfs/125 along with the small script to pass reliably. The lack of metadata write after the degraded mount, and forced metadata COW is saving us this time. So this patch will fix the behavior by not trust any cache in __raid56_parity_recover(), to solve the problem while still keep the cache useful. But please note that this test pass DOES NOT mean we have solved the destructive RMW problem, we just do better damage control a little better. Related patches: - btrfs: only write the sectors in the vertical stripe - `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") - btrfs: update stripe_sectors::uptodate in steal_rbio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Christoph Hellwig	711f447b4f	btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished finish_func is always set to finish_ordered_fn, so remove it and also the now pointless and somewhat confusingly named __endio_write_update_ordered wrapper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Nikolay Borisov	1f4f639fe7	btrfs: batch up release of reserved metadata for delayed items used for deletion With Filipe's recent rework of the delayed inode code one aspect which isn't batched is the release of the reserved metadata of delayed inode's delete items. With this patch on top of Filipe's rework and running the same test as provided in the description of a patch titled "btrfs: improve batch deletion of delayed dir index items" I observe the following change of the number of calls to btrfs_block_rsv_release: Before this change: - block_rsv_release: 1004 - btrfs_delete_delayed_items_total_time: 14602 - delete_batches: 505 After: - block_rsv_release: 510 - btrfs_delete_delayed_items_total_time: 13643 - delete_batches: 507 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:37 +02:00
Qu Wenruo	3613249a1b	btrfs: warn about dev extents that are inside the reserved range Btrfs on-disk format has reserved the first 1MiB for the primary super block (at 64KiB offset) and bootloaders may also use this space. This behavior is only introduced since v4.1 btrfs-progs release, although kernel can ensure we never touch the reserved range of super blocks, it's better to inform the end users, and a balance will resolve the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> [ update changelog and message ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	37f85ec320	btrfs: use named constant for reserved device space There's a reserved space on each device of size 1MiB that can be used by bootloaders or to avoid accidental overwrite. Use a symbolic constant with the explaining comment instead of hard coding the value and multiple comments. Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for the primary super block (at offset 64KiB), until then the range could have been used by mistake. Kernel has been always respecting the 1MiB range for writes. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	bfceac7fd3	btrfs: remove unused typedefs get_extent_t and btrfs_work_func_t Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	e3059ec06b	btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino There's only one function we pass to iterate_inodes_from_logical as iterator, so we can drop the indirection and call it directly, after moving the function to backref.c Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	875d1daa7b	btrfs: simplify parameters of backref iterators The inode reference iterator interface takes parameters that are derived from the context parameter, but as it's a void* type the values are passed individually. Change the ctx type to inode_fs_path as it's the only thing we pass and drop any parameters that are derived from that. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	ad6240f662	btrfs: call inode_to_path directly and drop indirection The functions for iterating inode reference take a function parameter but there's only one value, inode_to_path(). Remove the indirection and call the function. As paths_from_inode would become just an alias for iterate_irefs(), merge the two into one function. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	6d322b4839	btrfs: use ncopies from btrfs_raid_array in btrfs_num_copies() For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies directly, only for RAID5 and RAID6 we need some extra handling as there's no table value for that. For RAID10 there's a change from sub_stripes to ncopies. The values are the same but semantically we want to use number of copies, as this is what btrfs_num_copies does. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	0b30f71945	btrfs: use btrfs_raid_array to calculate number of parity stripes Use the raid table instead of hard coded values and rename the helper as it is exported. This could make later extension on RAID56 based profiles easier. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	6dead96c1a	btrfs: use btrfs_chunk_max_errors() to replace tolerance calculation In __btrfs_map_block() we have an assignment to @max_errors using nr_parity_stripes(). Although it works for RAID56 it's confusing. Replace it with btrfs_chunk_max_errors(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
Qu Wenruo	bc88b486d5	btrfs: remove parameter dev_extent_len from scrub_stripe() For scrub_stripe() we can easily calculate the dev extent length as we have the full info of the chunk. Thus there is no need to pass @dev_extent_len from the caller, and we introduce a helper, btrfs_calc_stripe_length(), to do the calculation from extent_map structure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	9db33891c7	btrfs: unify tree search helper returning prev and next nodes Simplify helper to return only next and prev pointers, we don't need all the node/parent/prev/next pointers of __etree_search as there are now other specialized helpers. Rename parameters so they follow the naming. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:36 +02:00
David Sterba	ec60c76f53	btrfs: make tree search for insert more generic and use it for tree_search With a slight extension of tree_search_for_insert (fill the return node and parent return parameters) we can avoid calling __etree_search from tree_search, that could be removed eventually in followup patches. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	bebb22c13d	btrfs: open code inexact rbtree search in tree_search The call chain from tree_search tree_search_for_insert __etree_search can be open coded and allow further simplifications, here we need a tree search with fallback to the next node in case it's not found. This is represented as __etree_search parameters next_ret=valid, prev_ret=NULL. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	c367602a78	btrfs: remove node and parent parameters from insert_state There's no caller left that would pass valid pointers to insert_state so we can drop them. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	fb8f07d2d8	btrfs: add fast path for extent_state insertion In two cases the exact location where to insert the extent state is known at the call time so we don't need to pass it to insert_state that takes the fast path. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	6d92b304ec	btrfs: pass bits by value not by pointer for extent_state helpers The bits are passed to all extent state helpers for no apparent reason, the value only read and never updated so remove the indirection and pass it directly. Also unify the type to u32 where needed. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	cee5126825	btrfs: lift start and end parameters to callers of insert_state Let callers of insert_state to set up the extent state to allow further simplifications of the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	c7e118cf98	btrfs: open code rbtree search in insert_state The rbtree search is a known pattern and can be open coded, allowing to remove the tree_insert and further cleanups. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
David Sterba	12c9cdda62	btrfs: open code rbtree search in split_state Preparatory work to remove tree_insert from extent_io.c, the rbtree search loop is a known and simple so it can be open coded. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	1c10702e7c	btrfs: raid56: avoid double for loop inside raid56_parity_scrub_stripe() Originally it's iterating all the sectors which has dbitmap sector for the vertical stripe. It can be easily converted to sector bytenr iteration with an test_bit() call. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	550cdeb3e0	btrfs: raid56: avoid double for loop inside raid56_rmw_stripe() This function doesn't even utilize full stripe skip, just iterate all the data sectors is definitely enough. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	aee35e4bcc	btrfs: raid56: avoid double for loop inside alloc_rbio_essential_pages() The double loop is just checking if the page for the vertical stripe is allocated. We can easily convert it to single loop and get rid of @stripe variable. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	ef340fccbe	btrfs: raid56: avoid double for loop inside __raid56_parity_recover() The double for loop can be easily converted to single for loop as we're really iterating the sectors in their bytenr order. The only exception is the full stripe skip, however that can also easily be done inside the loop. Add an ASSERT() along with a comment for that specific case. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:35 +02:00
Qu Wenruo	3692004465	btrfs: raid56: avoid double for loop inside finish_rmw() We can easily calculate the stripe number and sector number inside the stripe. Thus there is not much need for a double for loop. For the only case we want to skip the whole stripe, we can manually increase @total_sector_nr. This is not a recommended behavior, thus every time the iterator gets modified there will be a comment along with an ASSERT() for it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Josef Bacik	f31f09f6be	btrfs: tree-log: make the return value for log syncing consistent Currently we will return 1 or -EAGAIN if we decide we need to commit the transaction rather than sync the log. In practice this doesn't really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to commit the transaction. However this makes it hard to figure out what the correct thing to do is. Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the places where we want to force the transaction to be committed. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Johannes Thumshirn	5bea250881	btrfs: add tracepoints for ordered extents When debugging a reference counting issue with ordered extents, I've found we're lacking a lot of tracepoint coverage in the ordered extent code. Close these gaps by adding tracepoints after every refcount_inc() in the ordered extent code. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
David Sterba	15dcccdb8b	btrfs: sysfs: advertise zoned support among features We've hidden the zoned support in sysfs under debug config for the first releases but now the stability is reasonable, though not all features have been implemented. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	a4012f06f1	btrfs: split discard handling out of btrfs_map_block Mapping block for discard doesn't really share any code with the regular block mapping case. Split it out into an entirely separate helper that just returns an array of btrfs_discard_stripe structures and the number of stripes. This removes the need for the length field in the btrfs_io_context structure, so remove tht. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	5eecef7108	btrfs: stop looking at btrfs_bio->iter in index_one_bio All the bios that index_one_bio operates on are the bios submitted by the upper layer. These are never resubmitted to an actual device by the raid56 code, and thus the iter never changes from the initial state. Thus we can always just use bi_iter directly as it will be the same as the saved copy. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Qu Wenruo	dc4d316849	btrfs: reject log replay if there is unsupported RO compat flag [BUG] If we have a btrfs image with dirty log, along with an unsupported RO compatible flag: log_root 30474240 ... compat_flags 0x0 compat_ro_flags 0x40000003 ( FREE_SPACE_TREE \| FREE_SPACE_TREE_VALID \| unknown flag: 0x40000000 ) Then even if we can only mount it RO, we will still cause metadata update for log replay: BTRFS info (device dm-1): flagging fs with big metadata feature BTRFS info (device dm-1): using free space tree BTRFS info (device dm-1): has skinny extents BTRFS info (device dm-1): start tree-log replay This is definitely against RO compact flag requirement. [CAUSE] RO compact flag only forces us to do RO mount, but we will still do log replay for plain RO mount. Thus this will result us to do log replay and update metadata. This can be very problematic for new RO compat flag, for example older kernel can not understand v2 cache, and if we allow metadata update on RO mount and invalidate/corrupt v2 cache. [FIX] Just reject the mount unless rescue=nologreplay is provided: BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead We don't want to set rescue=nologreply directly, as this would make the end user to read the old data, and cause confusion. Since the such case is really rare, we're mostly fine to just reject the mount with an error message, which also includes the proper workaround. CC: stable@vger.kernel.org #4.9+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Qu Wenruo	97f09d55f1	btrfs: make btrfs_super_block::log_root_transid deprecated When using "btrfs inspect-internal dump-super" to inspect an fs with dirty log, it always shows the log_root_transid as 0: log_root 30474240 log_root_transid 0 <<< log_root_level 0 It turns out that, btrfs_super_block::log_root_transid is never really utilized (even no read for it). This can date back to the introduction of btrfs into upstream kernel. In fact, when reading log tree root, we always use btrfs_super_block::generation + 1 as the expected generation. So here we're completely safe to mark this member deprecated. In theory we can easily reuse this member for other purposes, but to be extra safe, here we follow the leafsize way, by adding "__unused_" for log_root_transid. And we can safely remove the accessors, since there is no such callers from the very beginning. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	722c82ac9e	btrfs: pass the btrfs_bio_ctrl to submit_one_bio submit_one_bio always works on the bio and compression flags from a btrfs_bio_ctrl structure. Pass the explicitly and clean up the calling conventions by handling a NULL bio in submit_one_bio, and using the btrfs_bio_ctrl to pass the mirror number as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	9845e5ddcb	btrfs: merge end_write_bio and flush_write_bio Merge end_write_bio and flush_write_bio into a single submit_write_bio helper, that either submits the bio or ends it if a negative errno was passed in. This consolidates a lot of duplicated checks in the callers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
Christoph Hellwig	2d5ac130fa	btrfs: don't use bio->bi_private to pass the inode to submit_one_bio submit_one_bio is only used for page cache I/O, so the inode can be trivially derived from the first page in the bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:34 +02:00
David Sterba	234fdd2815	btrfs: remove redundant check in up check_setget_bounds There are two separate checks in the bounds checker, the first one being a special case of the second. As this function is performance critical due to checking access to any eb member, reducing the size can slightly improve performance. On a release build on x86_64 the helper is completely inlined so the function call overhead is also gone. There was a report of 5% performance drop on metadata heavy workload, that disappeared after disabling asserts. The most significant part of that is the bounds checker. https://lore.kernel.org/linux-btrfs/20200724164147.39925-1-josef@toxicpanda.com/ After the analysis, the optimized code removes the worst overhead which is the function call and the performance was restored. https://lore.kernel.org/linux-btrfs/20200730110943.GE3703@twin.jikos.cz/ 1. baseline, asserts on, setget check on run time: 46s run time with perf: 48s 2. asserts on, comment out setget check run time: 44s run time with perf: 47s So this is confirms the 5% difference 3. asserts on, optimized seget check run time: 44s run time with perf: 47s The optimizations are reducing the number of ifs to 1 and inlining the hot path. Low-level stuff, gets the performance back. Patch below. 4. asserts off, no setget check run time: 44s run time with perf: 45s This verifies that asserts other than the setget check have negligible impact on performance and it's not harmful to keep them on. Analysis where the performance is lost: * check_setget_bounds is short function, but it's still a function call, changing the flow of instructions and given how many times it's called the overhead adds up * there are two conditions, one to check if the range is completely outside (member_offset > eb->len) or partially inside (member_offset + size > eb->len) Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Fabio M. De Francesco	51c0674a56	btrfs: replace kmap() with kmap_local_page() in lzo.c The use of kmap() is being deprecated in favor of kmap_local_page() where it is feasible. With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in lzo.c wherever the mappings are per thread and not globally visible. Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Fabio M. De Francesco	70826b6bd5	btrfs: replace kmap() with kmap_local_page() in inode.c The use of kmap() is being deprecated in favor of kmap_local_page() where it is feasible. With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in inode.c wherever the mappings are per thread and not globally visible. Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	9ff7ddd3c7	btrfs: do not allocate a btrfs_bio for low-level bios The bios submitted from btrfs_map_bio don't really interact with the rest of btrfs and the only btrfs_bio member actually used in the low-level bios is the pointer to the btrfs_io_context used for endio handler. Use a union in struct btrfs_io_stripe that allows the endio handler to find the btrfs_io_context and remove the spurious ->device assignment so that a plain fs_bio_set bio can be used for the low-level bios allocated inside btrfs_map_bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	a316a25991	btrfs: factor stripe submission logic out of btrfs_map_bio Move all per-stripe handling into submit_stripe_bio and use a label to cleanup instead of duplicating the logic. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	d7b9416fe5	btrfs: remove btrfs_end_io_wq All reads bio that go through btrfs_map_bio need to be completed in user context. And read I/Os are the most common and timing critical in almost any file system workloads. Embed a work_struct into struct btrfs_bio and use it to complete all read bios submitted through btrfs_map, using the REQ_META flag to decide which workqueue they are placed on. This removes the need for a separate 128 byte allocation (typically rounded up to 192 bytes by slab) for all reads with a size increase of 24 bytes for struct btrfs_bio. Future patches will reorganize struct btrfs_bio to make use of this extra space for writes as well. (All sizes are based a on typical 64-bit non-debug build) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	08a6f46434	btrfs: centralize setting REQ_META Set REQ_META in btrfs_submit_metadata_bio instead of the various callers. We'll start relying on this flag inside of btrfs in a bit, and this ensures it is always set correctly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	fed8a72df1	btrfs: don't use btrfs_bio_wq_end_io for compressed writes Compressed write bio completion is the only user of btrfs_bio_wq_end_io for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal here as we only real need user context for the final completion of a compressed_bio structure, and not every single bio completion. Add a work_struct to struct compressed_bio instead and use that to call finish_compressed_bio_write. This allows to remove all handling of write bios in the btrfs_bio_wq_end_io infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	02bb5b7247	btrfs: don't double-defer bio completions for compressed reads The bio completion handler of the bio used for the compressed data is already run in a workqueue using btrfs_bio_wq_end_io, so don't schedule the completion of the original bio to the same workqueue again but just execute it directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	d34e123de1	btrfs: defer I/O completion based on the btrfs_raid_bio Instead of attaching an extra allocation an indirect call to each low-level bio issued by the RAID code, add a work_struct to struct btrfs_raid_bio and only defer the per-rbio completion action. The per-bio action for all the I/Os are trivial and can be safely done from interrupt context. As a nice side effect this also allows sharing the boilerplate code for the per-bio completions Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	c93104e758	btrfs: split btrfs_submit_data_bio to read and write parts Split btrfs_submit_data_bio into one helper for reads and one for writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	e6484bd488	btrfs: simplify code flow in btrfs_submit_dio_bio There is no exit block and cleanup and the function is reasonably short so we can use inline return and not the goto. This makes the function more straight forward. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:33 +02:00
Christoph Hellwig	b4c46bdea9	btrfs: move more work into btrfs_end_bioc Assign ->mirror_num and ->bi_status in btrfs_end_bioc instead of duplicating the logic in the callers. Also remove the bio argument as it always must be bioc->orig_bio and the now pointless bioc_error that did nothing but assign bi_sector to the same value just sampled in the caller. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	d681559280	btrfs: send: enable support for stream v2 and compressed writes Now that the new support is implemented, allow the ioctl to accept v2 and the compressed flag, and update the version in sysfs. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	3ea4dc5bf0	btrfs: send: send compressed extents with encoded writes Now that all of the pieces are in place, we can use the ENCODED_WRITE command to send compressed extents when appropriate. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	a4b333f227	btrfs: send: get send buffer pages for protocol v2 For encoded writes in send v2, we will get the encoded data with btrfs_encoded_read_regular_fill_pages(), which expects a list of raw pages. To avoid extra buffers and copies, we should read directly into the send buffer. Therefore, we need the raw pages for the send buffer. We currently allocate the send buffer with kvmalloc(), which may return a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page(). However, the buffer size we use (144K) is not a power of two, which in theory is not guaranteed to return a page-aligned buffer, and in practice would waste a lot of memory due to rounding up to the next power of two. 144K is large enough that it usually gets allocated with vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc() and save the pages in an array. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	356bbbb66b	btrfs: send: write larger chunks when using stream v2 The length field of the send stream TLV header is 16 bits. This means that the maximum amount of data that can be sent for one write is 64K minus one. However, encoded writes must be able to send the maximum compressed extent (128K) in one command, or more. To support this, send stream version 2 encodes the DATA attribute differently: it has no length field, and the length is implicitly up to the end of containing command (which has a 32bit length field). Although this is necessary for encoded writes, normal writes can benefit from it, too. Also add a check to enforce that the DATA attribute is last. It is only strictly necessary for v2, but we might as well make v1 consistent with it. For v2, let's bump up the send buffer to the maximum compressed extent size plus 16K for the other metadata (144K total). Since this will most likely be vmalloc'd (and always will be after the next commit), we round it up to the next page since we might as well use the rest of the page on systems with >16K pages. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	b7c14f23fb	btrfs: send: add stream v2 definitions This adds the definitions of the new commands for send stream version 2 and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a. chattr), and encoded writes. It also documents two changes to the send stream format in v2: the receiver shouldn't assume a maximum command size, and the DATA attribute is encoded differently to allow for writes larger than 64k. These will be implemented in subsequent changes, and then the ioctl will accept the new version and flag. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	54cab6aff8	btrfs: send: explicitly number commands and attributes Commit `e77fbf9903` ("btrfs: send: prepare for v2 protocol") added _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the version plus 1, but as written this creates gaps in the number space. The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2 has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that 23 and 24 are valid commands. Instead, let's explicitly number all of the commands, attributes, and sentinel MAX constants. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Omar Sandoval	ca182acc53	btrfs: send: remove unused send_ctx::{total,cmd}_send_size We collect these statistics but have never exposed them in any way. I also didn't find any patches that ever attempted to make use of them. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Stefan Roesch	22c55e3bbb	btrfs: sysfs: add force_chunk_alloc trigger to force allocation Adds write-only trigger to force new chunk allocation for a given block group type. It is at /sys/fs/btrfs/<uuid>/allocation/<type>/force_chunk_alloc Note: this is now only for debugging and testing and is enabled with the CONFIG_BTRFS_DEBUG configuration option. The transaction is started from sysfs context and can be problematic in some cases. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ Changes from the original submission: - update changelog - drop unnecessary error messages - switch value to bool and use kstrtobool - move BTRFS_ATTR_W definition - add comment for using transaction ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Stefan Roesch	19fc516a51	btrfs: sysfs: export chunk size in space infos Add new sysfs knob /sys/fs/btrfs/<uuid>/allocation/<type>/chunk_size. This allows to query the chunk size and also set the chunk size. Constraints: - can be changed by root only - system chunk size can't be set - maximum chunk size is 10% of the filesystem size - final value is rounded down to a multiple of 256M - cannot be set on zoned filesystem Note, that rounding and the 10% clamp will result to a different value on filesystems smaller than 10G, typically 768M. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ Changes to original submission: - document setting constraints - drop read-only requirement - drop unnecessary error messages - fix return values of _store callback - use memparse for the value - fix rounding down to 256M ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Stefan Roesch	f6fca3917b	btrfs: store chunk size in space-info struct The chunk size is stored in the btrfs_space_info structure. It is initialized at the start and is then used. A new API is added to update the current chunk size. This API is used to be able to expose the chunk_size as a sysfs setting. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename and merge helpers, switch atomic type to u64, style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:32 +02:00
Josef Bacik	71b68e9e35	btrfs: do not batch insert non-consecutive dir indexes during log replay While running generic/475 in a loop I got the following error BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524) <snip> item 65 key (263 96 517) itemoff 14132 itemsize 33 item 66 key (263 96 523) itemoff 14099 itemsize 33 item 67 key (263 96 525) itemoff 14066 itemsize 33 item 68 key (263 96 531) itemoff 14033 itemsize 33 item 69 key (263 96 524) itemoff 14000 itemsize 33 As you can see here we have 3 dir index keys with the dir index value of 523, 524, and 525 inserted between 517 and 524. This occurs because our dir index insertion code will bulk insert all dir index items on the node regardless of their actual key value. This makes sense on a normally running system, because if there's a gap in between the items there was a deletion before the item was inserted, so there's not going to be an overlap of the dir index items that need to be inserted and what exists on disk. However during log replay this isn't necessarily true, we could have any number of dir indexes in the tree already. Fix this by seeing if we're replaying the log, and if we are simply skip batching if there's a gap in the key space. This file system was left broken from the fstest, I tested this patch against the broken fs to make sure it replayed the log properly, and then btrfs checked the file system after the log replay to verify everything was ok. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:45:14 +02:00
Filipe Manana	763748b238	btrfs: reduce amount of reserved metadata for delayed item insertion Whenever we want to create a new dir index item (when creating an inode, create a hard link, rename a file) we reserve 1 unit of metadata space for it in a transaction (that's 256K for a node/leaf size of 16K), and then create a delayed insertion item for it to be added later to the subvolume's tree. That unit of metadata is kept until the delayed item is inserted into the subvolume tree, which may take a while to happen (in the worst case, it's done only when the transaction commits). If we have multiple dir index items to insert for the same directory, say N index items, and they all fit in a single leaf of metadata, then we are holding N units of reserved metadata space when all we need is 1 unit. This change addresses that, whenever a new delayed dir index item is added, we release the unit of metadata the caller has reserved when it started the transaction if adding that new dir index item does not result in touching one more metadata leaf, otherwise the reservation is kept by transferring it from the transaction block reserve to the delayed items block reserve, just like before. Given that with a leaf size of 16K we can have a few hundred dir index items in a single leaf (the exact value depends on file name lengths), this reduces pressure on metadata reservation by releasing unnecessary space much sooner. The following fs_mark test showed some improvement when creating many files in parallel on machine running a non debug kernel (debian's default kernel config) with 12 cores: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" FILES=100000 THREADS=$(nproc --all) echo "performance" \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before: FSUse% Count Size Files/sec App Overhead 2 1200000 0 225991.3 5465891 4 2400000 0 345728.1 5512106 4 3600000 0 346959.5 5557653 8 4800000 0 329643.0 5587548 8 6000000 0 312657.4 5606717 8 7200000 0 281707.5 5727985 12 8400000 0 88309.8 5020422 12 9600000 0 85835.9 5207496 16 10800000 0 81039.2 5404964 16 12000000 0 58548.6 5842468 After: FSUse% Count Size Files/sec App Overhead 2 1200000 0 230604.5 5778375 4 2400000 0 348908.3 5508072 4 3600000 0 357028.7 5484337 6 4800000 0 342898.3 5565703 6 6000000 0 314670.8 5751555 8 7200000 0 282548.2 5778177 12 8400000 0 90844.9 5306819 12 9600000 0 86963.1 5304689 16 10800000 0 89113.2 5455248 16 12000000 0 86693.5 5518933 The "after" results are after applying this patch and all the other patches in the same patchset, which is comprised of the following changes: btrfs: balance btree dirty pages and delayed items after a rename btrfs: free the path earlier when creating a new inode btrfs: balance btree dirty pages and delayed items after clone and dedupe btrfs: add assertions when deleting batches of delayed items btrfs: deal with deletion errors when deleting delayed items btrfs: refactor the delayed item deletion entry point btrfs: improve batch deletion of delayed dir index items btrfs: assert that delayed item is a dir index item when adding it btrfs: improve batch insertion of delayed dir index items btrfs: do not BUG_ON() on failure to reserve metadata for delayed item btrfs: set delayed item type when initializing it btrfs: reduce amount of reserved metadata for delayed item insertion Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:36 +02:00
Filipe Manana	c9d02ab4b4	btrfs: set delayed item type when initializing it Currently we set the type of a delayed item only after successfully inserting it into its respective rbtree. This is fine, as the type is not used anywhere before that point, but for the next patch in the series, there will be the need to check the type of a delayed item before inserting it into a rbtree. So set the type of a delayed item immediately after allocating it. This also makes the trivial wrappers for adding insertion and deletion useless, so it removes them as well. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:36 +02:00
Filipe Manana	3bae13e9d4	btrfs: do not BUG_ON() on failure to reserve metadata for delayed item At btrfs_insert_delayed_dir_index(), we don't expect the metadata reservation for the delayed dir index item insertion to fail, because the caller is supposed to have reserved 1 unit of metadata space for that. All callers are able to deal with an error in case that happens, so there is no need for something so drastic as a BUG_ON() in case of failure. Instead just emit a warning, so that's easily noticed during development (fstests in particular), and return the error to the caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	06ac264f3f	btrfs: improve batch insertion of delayed dir index items Currently we group delayed dir index items for insertion as a single batch (a single btree operation) as long as their keys are sequential in the key space. For example we have delayed index items for the following index keys: 10, 11, 12, 15, 16, 20, 21 We end up building three batches: 1) First one for index keys 10, 11 and 12; 2) Second one for index keys 15 and 16; 3) Third one for index keys 20 and 21. However, since the dir index numbers come from a monotonically increasing counter and are never reused, we could group all these items into a single batch. The existence of holes in the sequence happens only when we had delayed dir index items for insertion that got deleted before they were flushed to the subvolume's tree. The delayed items are stored in a rbtree based on their key order, so we can just group items into a batch as long as they all fit in a leaf, and ignore if there's a gap (key offset, index number) between two consecutive items. This is more efficient and reduces the amount of time spent when running delayed items if there are gaps between dir index items. For example running the following test script: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT NUM_FILES=100 mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Now delete every other file, to create gaps in the dir index keys. for ((i = 1; i <= $NUM_FILES; i += 2)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) sync end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo -e "\nsync took $dur milliseconds" umount $MNT While having the following bpftrace script running in another shell: $ cat bpf-delayed-items-inserts.sh #!/usr/bin/bpftrace /* Must add 'noinline' to btrfs_insert_delayed_items(). */ k:btrfs_insert_delayed_items { @start_insert_delayed_items[tid] = nsecs; } k:btrfs_insert_empty_items /@start_insert_delayed_items[tid]/ { @insert_batches = count(); } kr:btrfs_insert_delayed_items /@start_insert_delayed_items[tid]/ { $dur = (nsecs - @start_insert_delayed_items[tid]) / 1000; @btrfs_insert_delayed_items_total_time = sum($dur); delete(@start_insert_delayed_items[tid]); } Before this change: @btrfs_insert_delayed_items_total_time: 576 @insert_batches: 51 After this change: @btrfs_insert_delayed_items_total_time: 174 @insert_batches: 2 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	a176affe54	btrfs: assert that delayed item is a dir index item when adding it All delayed items are for dir index items, we don't support any other item types at the moment. So simplify __btrfs_add_delayed_item() and add an assertion for checking the item's key type. This also allows the next change to be simpler and avoid to check key types. In case we add support for different item types in the future, then we'll hit the assertion during development and be able to adjust any code that is assuming delayed items are always associated to dir index items. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	4bd02d9012	btrfs: improve batch deletion of delayed dir index items Currently we group delayed dir index items for deletion in a single batch (single btree operation) as long as they all exist in the same leaf and as long as their keys are sequential in the key space. For example if we have a leaf that has dir index items with offsets: 2, 3, 4, 6, 7, 10 And we have delayed dir index items for deleting all these indexes, and no delayed items for any other index keys in between, then we end up deleting in 3 batches: 1) First batch for indexes 2, 3 and 4; 2) Second batch for indexes 6 and 7; 3) Third batch for index 10. This is a waste because we can delete all the index keys in a single batch. What matters is that each consecutive delayed index key matches each consecutive dir index key in a leaf. So update the logic at btrfs_batch_delete_items() to check only for a key match between delayed dir index items and dir index items in a leaf. Also avoid the useless first iteration on comparing the key of the first slot to delete with the key of the first delayed item, as it's silly since they always match, as the delayed item's key was used for the btree search that gave us the path we have. This is more efficient and reduces runtime of running delayed items, as well as lock contention on the subvolume's tree. For example, the following test script: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT NUM_FILES=1000 mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Now delete every other file, to create gaps in the dir index keys. for ((i = 1; i <= $NUM_FILES; i += 2)); do rm -f $MNT/testdir/file_$i done # Sync to force any delayed items to be flushed to the tree. sync start=$(date +%s%N) rm -fr $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo -e "\nrm -fr took $dur milliseconds" umount $MNT Running that test script while having the following bpftrace script running in another shell: $ cat bpf-measure.sh #!/usr/bin/bpftrace /* Add 'noinline' to btrfs_delete_delayed_items()'s definition. */ k:btrfs_delete_delayed_items { @start_delete_delayed_items[tid] = nsecs; } k:btrfs_del_items /@start_delete_delayed_items[tid]/ { @delete_batches = count(); } kr:btrfs_delete_delayed_items /@start_delete_delayed_items[tid]/ { $dur = (nsecs - @start_delete_delayed_items[tid]) / 1000; @btrfs_delete_delayed_items_total_time = sum($dur); delete(@start_delete_delayed_items[tid]); } Before this change: @btrfs_delete_delayed_items_total_time: 9563 @delete_batches: 1001 After this change: @btrfs_delete_delayed_items_total_time: 7328 @delete_batches: 509 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	36baa2c751	btrfs: refactor the delayed item deletion entry point The delayed item deletion entry point, btrfs_delete_delayed_items(), is a bit convoluted for a few reasons: 1) It's really a loop disguised with labels and goto statements; 2) There's a 'delete_fail' label which isn't only for error cases, we can jump to that label even if no error happened, if we simply don't have more delayed items to delete; 3) Unnecessarily keeps track of the current and previous items for no good reason, as after getting the next item and releasing the current one, it just jumps to the 'again' label just to look again for the first delayed item; 4) When a delayed item is not in the tree (because it was already deleted before), it releases the item while holding a path locked, which is not necessary and adds more contention to the tree, specially taking into account that the path came from a deletion search, meaning we have write locks for nodes at levels 2, 1 and 0. And releasing the item is not computationally trivial (rb tree deletion, a kfree() and some trivial things). So refactor it to use a while loop and add some comments to make it more obvious why we can have delayed items without a matching item in the tree as well as why not keep the delayed node locked all the time when running all its deletion items. This is also a preparation for some upcoming work involving delayed items. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	2b1d260de1	btrfs: deal with deletion errors when deleting delayed items Currently, btrfs_delete_delayed_items() ignores any errors returned from btrfs_batch_delete_items(). This looks fishy but it's not a problem at the moment because: 1) Two of the errors returned from btrfs_batch_delete_items() are for impossible cases, cases where a delayed item does not match any item in the leaf the path points to - btrfs_delete_delayed_items() always calls btrfs_batch_delete_items() with a path that points to a leaf that contains an item matching a delayed item; 2) btrfs_batch_delete_items() may return an error from btrfs_del_items(), in which case it does not release the delayed items of the batch. At the moment this is harmless because btrfs_del_items() actually is always able to delete items, even if it returns an error - when it returns an error it's because it ended up with a leaf mostly empty (less than 1/3 full) and failed to migrate items from that leaf into its neighbour leaves - this is not critical, as all the items were deleted, we just left the tree a bit unbalanced, but it's still a valid tree and causes no harm, and future operations on the tree will eventually balance it. So even if we get an error from btrfs_del_items(), the delayed items will not be released but the next time we run delayed items we will find out, at btrfs_delete_delayed_items(), that they are not present in the tree anymore and then release them. This is all a bit subtle, and it's certainly prone to be a disaster in case btrfs_del_items() changes one day and may return errors before being able to delete all the requested items, in which case we could leave the filesystem in an inconsistent state as we would commit a transaction despite a failure from deleting items from the tree. So make btrfs_delete_delayed_items() check for any errors from the call to btrfs_batch_delete_items(). Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	659192e668	btrfs: add assertions when deleting batches of delayed items There are a few impossible cases that btrfs_batch_delete_items() tries to deal with: 1) Getting a path pointing to a NULL leaf; 2) The leaf slot is pointing beyond the last item in the leaf; 3) We can't find a single item to delete. The first case is impossible because the given path was returned by a successful call to btrfs_search_slot(). Replace the BUG_ON() with an ASSERT for this. The second case is impossible because we are always called when a delayed item matches an item in the given leaf. So add an ASSERT() for that and if that condition is not satisfied, trigger a warning and return an error. The third case is impossible exactly because of the same reason as the second case. The given delayed item matches one item in the leaf, so we know that our batch always has at least one item. Add an ASSERT to check that, trigger a warning if that expectation fails and return an error. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	6fe81a3a3a	btrfs: balance btree dirty pages and delayed items after clone and dedupe When reflinking extents (clone and deduplication), we need to touch the btree of the destination inode's subvolume, as well as potentially create a delayed inode for the destination inode (if it was not created before). However we are neither balancing the btree dirty pages nor the delayed items after such operations, so if we have a task that is doing a long series of clone or deduplication operations, it can result in accumulation of too many btree dirty pages and delayed items. So just call btrfs_btree_balance_dirty() after clone and deduplication, just like we do for every other system call that results on modifying a btree and adding delayed items. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	814e77182b	btrfs: free the path earlier when creating a new inode When creating an inode, through btrfs_create_new_inode(), we release the path we allocated before once we don't need it anymore. But we keep it allocated until we return from that function, which is wasteful because after we release the path we do several things that can allocate yet another path: inheriting properties, setting the xattrs used by ACLs and secutiry modules, adding an orphan item (O_TMPFILE case) or adding a dir item (for the non-O_TMPFILE case). So instead of releasing the path once we don't need it anymore, free it instead. This way we avoid having two paths allocated until we return from btrfs_create_new_inode(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Filipe Manana	ca6dee6b79	btrfs: balance btree dirty pages and delayed items after a rename A rename operation modifies a subvolume's btree, to remove the old dir item, add the new dir item, remove an inode ref and add a new inode ref. It can also create the delayed inode for the inodes involved in the operation, and it creates two delayed dir index items, one to delete the old name and another one to add the new name. However we are neither balancing the btree dirty pages nor the delayed items after a rename, which can result in accumulation of too many btree dirty pages and delayed items, specially if a task is doing a series of rename operations (for example it can happen for package installations/upgrades through the zypper tool). So just call btrfs_btree_balance_dirty() after a rename, just like we do for every other system call that results on modifying a btree and adding delayed items. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:35 +02:00
Qu Wenruo	b8bea09a45	btrfs: add trace event for submitted RAID56 bio Add tracepoint for better insight to how the RAID56 data are submitted. The output looks like this: (trace event header and UUID skipped) raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768 raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536 raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768 raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768 The above debug output is from a 32K data write into an empty RAID56 data chunk. Some explanation on the event output: full_stripe: the logical bytenr of the full stripe devid: btrfs devid type: raid stripe type. DATA1: the first data stripe DATA2: the second data stripe PQ1: the P stripe PQ2: the Q stripe offset: the offset inside the stripe. opf: the bio op type physical: the physical offset the bio is for len: the length of the bio The first two lines are from partial RMW read, which is reading the remaining data stripes from disks. The last two lines are for full stripe RMW write, which is writing the involved two 16K stripes (one for DATA1 stripe, one for P stripe). The stripe for DATA2 doesn't need to be written. There are 5 types of trace events: - raid56_read_partial Read remaining data for regular read/write path. - raid56_write_stripe Write the modified stripes for regular read/write path. - raid56_scrub_read_recover Read remaining data for scrub recovery path. - raid56_scrub_write_stripe Write the modified stripes for scrub path. - raid56_scrub_read Read remaining data for scrub path. Also, since the trace events are included at super.c, we have to export needed structure definitions to 'raid56.h' and include the header in super.c, or we're unable to access those members. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	4d10046613	btrfs: update stripe_sectors::uptodate in steal_rbio [BUG] With added debugging, it turns out the following write sequence would cause extra read which is unnecessary: # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \ -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \ $mnt/file The debug message looks like this (btrfs header skipped): partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536 full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536 full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536 ^^^^ Still partial read, even 389152768 is already cached by the first. write. full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536 ^^^^ Still partial read for 298844160. full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768 This means every 32K writes, even they are in the same full stripe, still trigger read for previously cached data. This would cause extra RAID56 IO, making the btrfs raid56 cache useless. [CAUSE] Commit `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") tries to make steal_rbio() subpage compatible, but during that conversion, there is one thing missing. We no longer rely on PageUptodate(rbio->stripe_pages[i]), but rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate. This means, previously if we switch the pointer, everything is done, as the PageUptodate flag is still bound to that page. But now we have to manually mark the involved sectors uptodate, or later raid56_rmw_stripe() will find the stolen sector is not uptodate, and assemble the read bio for it, wasting IO. [FIX] We can easily fix the bug, by also update the rbio->stripe_sectors[].uptodate in steal_rbio(). With this fixed, now the same write pattern no longer leads to the same unnecessary read: partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536 full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536 full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768 ^^^ No more partial read, directly into the write path. full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768 Fixes: `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
David Sterba	21a8935ead	btrfs: remove redundant calls to flush_dcache_page Both memzero_page and memcpy_to_page already call flush_dcache_page so we can remove the calls from btrfs code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	bd8f7e6277	btrfs: only write the sectors in the vertical stripe which has data stripes If we have only 8K partial write at the beginning of a full RAID56 stripe, we will write the following contents: 0 8K 32K 64K Disk 1 (data): \|XX\| \| \| Disk 2 (data): \| \| \| Disk 3 (parity): \|XXXXXXXXXXXXXXX\|XXXXXXXXXXXXXXX\| \|X\| means the sector will be written back to disk. Note that, although we won't write any sectors from disk 2, but we will write the full 64KiB of parity to disk. This behavior is fine for now, but not for the future (especially for RAID56J, as we waste quite some space to journal the unused parity stripes). So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we queue a higher level bio into an rbio, we will update rbio::dbitmap to indicate which vertical stripes we need to writeback. And at finish_rmw(), we also check dbitmap to see if we need to write any sector in the vertical stripe. So after the patch, above example will only lead to the following writeback pattern: 0 8K 32K 64K Disk 1 (data): \|XX\| \| \| Disk 2 (data): \| \| \| Disk 3 (parity): \|XX\| \| \| Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	381b9b4c9c	btrfs: use integrated bitmaps for scrub_parity::dbitmap and ebitmap Previously we use "unsigned long " for those two bitmaps. But since we only support fixed stripe length (64KiB, already checked in tree-checker), "unsigned long " is really a waste of memory, while we can just use "unsigned long". This saves us 8 bytes in total for scrub_parity. To be extra safe, add an ASSERT() making sure calclulated @nsectors is always smaller than BITS_PER_LONG. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	c67c68eb57	btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap Previsouly we use "unsigned long " for those two bitmaps. But since we only support fixed stripe length (64KiB, already checked in tree-checker), "unsigned long " is really a waste of memory, while we can just use "unsigned long". This saves us 8 bytes in total for btrfs_raid_bio. To be extra safe, add an ASSERT() making sure calculated @stripe_nsectors is always smaller than BITS_PER_LONG. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Nikolay Borisov	099aa97213	btrfs: use btrfs_try_lock_balance in btrfs_ioctl_balance This eliminates 2 labels and makes the code generally more streamlined. Also rename the 'out_bargs' label to 'out_unlock' since bargs is going to be freed under the 'out' label. This also fixes a memory leak since bargs wasn't correctly freed in one of the condition which are now moved in btrfs_try_lock_balance. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Nikolay Borisov	7fb10ed89e	btrfs: introduce btrfs_try_lock_balance This function contains the factored out locking sequence of btrfs_ioctl_balance. Having this piece of code separate helps to simplify btrfs_ioctl_balance which has too complicated. This will be used in the next patch to streamline the logic in btrfs_ioctl_balance. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Christoph Hellwig	1e87770cb3	btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio Use the new btrfs_bio_for_each_sector iterator to simplify btrfs_check_read_dio_bio. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Qu Wenruo	261d812b04	btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks Add a helper that works similar to __bio_for_each_segment, but instead of iterating over PAGE_SIZE chunks it iterates over each sector. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: split from a larger patch, and iterate over the offset instead of the offset bits] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ add parameter comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Christoph Hellwig	a89ce08ce6	btrfs: factor out a btrfs_csum_ptr helper Add a helper to find the csum for a byte offset into the csum buffer. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Christoph Hellwig	97861cd166	btrfs: refactor end_bio_extent_readpage code flow Untangle the goto and move the code it jumps to so it goes in the order of the most likely states first. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:34 +02:00
Christoph Hellwig	a5aa7ab6e7	btrfs: factor out a helper to end a single sector buffer I/O Add a helper to end I/O on a single sector, which will come in handy with the new read repair code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Qu Wenruo	fd5a6f63cb	btrfs: remove duplicated parameters from submit_data_read_repair() The function submit_data_read_repair() is only called for buffered data read path, thus those members can be calculated using bvec directly: - start start = page_offset(bvec->bv_page) + bvec->bv_offset; - end end = start + bvec->bv_len - 1; - page page = bvec->bv_page; - pgoff pgoff = bvec->bv_offset; Thus we can safely replace those 4 parameters with just one bio_vec. Also remove the unused return value. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: also remove the return value] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Qu Wenruo	ae643a74eb	btrfs: introduce a data checksum checking helper Although we have several data csum verification code, we never have a function really just to verify checksum for one sector. Function check_data_csum() do extra work for error reporting, thus it requires a lot of extra things like file offset, bio_offset etc. Function btrfs_verify_data_csum() is even worse, it will utilize page checked flag, which means it can not be utilized for direct IO pages. Here we introduce a new helper, btrfs_check_sector_csum(), which really only accept a sector in page, and expected checksum pointer. We use this function to implement check_data_csum(), and export it for incoming patch. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [hch: keep passing the csum array as an arguments, as the callers want to print it, rename per request] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Qu Wenruo	b036f47996	btrfs: quit early if the fs has no RAID56 support for raid56 related checks The following functions do special handling for RAID56 chunks: - btrfs_is_parity_mirror() Check if the range is in RAID56 chunks. - btrfs_full_stripe_len() Either return sectorsize for non-RAID56 profiles or full stripe length for RAID56 chunks. But if a filesystem without any RAID56 chunks, it will not have RAID56 incompat flags, and we can skip the chunk tree looking up completely. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Fanjun Kong	1280d2d165	btrfs: use PAGE_ALIGNED instead of IS_ALIGNED The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's use it instead of IS_ALIGNED and passing PAGE_SIZE directly. Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Fanjun Kong <bh1scw@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Pankaj Raghav	31f3726980	btrfs: zoned: fix comment description for sb_write_pointer logic Fix the comment to represent the actual logic used for sb_write_pointer - Empty[0] && In use[1] should be an invalid state instead of returning zone 0 wp - Empty[0] && Full[1] should be returning zone 0 wp instead of zone 1 wp - In use[0] && Empty[1] should be returning zone 0 wp instead of being an invalid state - In use[0] && Full[1] should be returning zone 0 wp instead of returning zone 1 wp - Full[0] && Empty[1] should be returning zone 1 wp instead of returning zone 0 wp - Full[0] && In use[1] should be returning zone 1 wp instead of returning zone 0 wp Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
David Sterba	143823cf4d	btrfs: fix typos in comments Codespell has found a few typos. Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-25 17:44:33 +02:00
Linus Torvalds	972a278fe6	for-5.19-rc7-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLRpPgACgkQxWXV+ddt WDtu/BAAnfx7CXKIfWKpz6FZEio9Qb3mUHVOglyKzqR0qB72OdrC1dQMvEWPJc6h N65di6+8tTNmRIlaFBMU0MDHODR2aDRpDtlR9eUzUuidTc4iOp1fi31uBwl31r7b k8mCZBc/IAdfH13lBtcfkb2HGid7rik5ZC6Kx/glMcqh647QkSMAleupUsIYHKsK IgcUWuN3wFIUK2WVgsja7+ljlwIHBHKRp9yrEYw+ef/B0NCNKvOnrIOPJzO7nxMP 1FbqJ6F7u7HjoMFcMwn5rbV/BoIwSSvXyKRqOW+EhGeQR/imVmkH9jXJ7wXdblSz IvSqaZ0DaWWSvivdMpwbr8Z0Cu4iIYhVY6PSA0hukR63qB5GwKKJ6j1L0zoYoz8C IDWJPW03FNRIu5ZOduvUQ3qG7jcJQZ3WPCCfrDST1cO2xHT/7f65Tjz4k0hvp4za edITetC1mEv310CHeGsJaLxGYPNrRe38VZYPxgJ7yFpteGYjh0ZwsuyUHb4MH1no JWwgElNW+m1BatdWSUBYk6xhqod1s2LOFPNqo7jNlv8I27hPViqCbBA2i9FlkXf+ FwL5kWyJXs69gfjUIj59381Z0U1VdA1tvU8GP2m2+JvIDS6ooAcZj7yEQ69mCZxi 2RFJIU0NFbnc/5j2ARSzOTGs9glDD0yffgXJM+cK+TWsQ3AC31I= =/47A -----END PGP SIGNATURE----- Merge tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs reverts from David Sterba: "Due to a recent report [1] we need to revert the radix tree to xarray conversion patches. There's a problem with sleeping under spinlock, when xa_insert could allocate memory under pressure. We use GFP_NOFS so this is a real problem that we unfortunately did not discover during review. I'm sorry to do such change at rc6 time but the revert is IMO the safer option, there are patches to use mutex instead of the spin locks but that would need more testing. The revert branch has been tested on a few setups, all seem ok. The conversion to xarray will be revisited in the future" Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1] * tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: turn delayed_nodes_tree into an XArray" Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" Revert "btrfs: turn fs_info member buffer_radix into XArray" Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"	2022-07-16 13:48:55 -07:00
David Sterba	088aea3b97	Revert "btrfs: turn delayed_nodes_tree into an XArray" This reverts commit `253bf57555`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:15:19 +02:00
David Sterba	5b8418b843	Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" This reverts commit `4076942021`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:14:58 +02:00
David Sterba	01cd390903	Revert "btrfs: turn fs_info member buffer_radix into XArray" This reverts commit `8ee922689d`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:14:33 +02:00
David Sterba	fc7cbcd489	Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray" This reverts commit `48b36a602a`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:14:28 +02:00
Bart Van Assche	bf9486d6dd	fs/btrfs: Use the enum req_op and blk_opf_t types Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for variables that represent request flags. Acked-by: David Sterba <dsterba@suse.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-51-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:32 -06:00
Linus Torvalds	5a29232d87	for-5.19-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLMiQQACgkQxWXV+ddt WDvBQg/+I1ebfW2DFY8kBwy7c1qKZWIhNx1VVk2AegIXvrW/Tos7wp5O6fi7p/jL d6k8zO/zFLlfiI4Ckmz3gt7cxaMTNXxr6+GQpNNm1b92Wdcy1a+3gquzcehT9Q10 ZB4ecPWzEDXgORvdBYG2eD2Z8PrsF0Wu88XRDiiJOBQLjZ+k2sVp8QvJlOllLDoC m7rPoq98jC6VpZwFJ+fGk2jC7y4+1QXrOuQMy7LRTe59Thp6wUFDDPtkKfr5scDC UxkctlUdInD7A6DVvPzwaBFNoT8UeEByGHcMd3KjjrTdmqSWW6k8FiF4ckZwA3zJ oPdJVzdC5a2W7t6BHw+t7VNmkKd+swnr2sVSGQ8eIzF7z3/JSqyYVwziOD1YzAdU QUmawWm4/SFvsbO8aoLrEKNbUiTgQwVbKzJh4Dhu9VJ43jeCwCX7pa/uZI4evgyG T0tuwm58bWCk4y1o1fcFYgf4JcVgK23F2vKckUFZeHoV3Q8R0DnPCCGTqs1qT5vY irZ9AIawmaR09JptMjjsAEjDA9qb16Ut/J6/anukyCgL610EyYZG7zb1WH1cUD1o zNXY6O/iKyNdiXj7V1fTMiG/M8hGDcFu4pOpBk3hFjHEXX9BefoVC0J5YzvCecPz isqboD5Lt1I4mrzac1X+serMYfVbFH6+tsEPBQBZf/o/a0u43jI= =Cxvn -----END PGP SIGNATURE----- Merge tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A more fixes that seem to me to be important enough to get merged before release: - in zoned mode, fix leak of a structure when reading zone info, this happens on normal path so this can be significant - in zoned mode, revert an optimization added in 5.19-rc1 to finish a zone when the capacity is full, but this is not reliable in all cases - try to avoid short reads for compressed data or inline files when it's a NOWAIT read, applications should handle that but there are two, qemu and mariadb, that are affected" * tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: drop optimization of zone finish btrfs: zoned: fix a leaked bioc in read_zone_info btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents	2022-07-11 14:41:44 -07:00
Naohiro Aota	b3a3b02557	btrfs: zoned: drop optimization of zone finish We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we wrote fully into the zone. The assumption is determined by "alloc_offset == capacity". This condition won't work if the last ordered extent is canceled due to some errors. In that case, we consider the zone is deactivated without sending the finish command while it's still active. This inconstancy results in activating another block group while we cannot really activate the underlying zone, which causes the active zone exceeds errors like below. BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0 nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0 nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0 Fix the issue by removing the optimization for now. Fixes: `8376d9e1ed` ("btrfs: zoned: finish superblock zone once no space left for new SB") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-08 19:18:00 +02:00
Christoph Hellwig	2963457829	btrfs: zoned: fix a leaked bioc in read_zone_info The bioc would leak on the normal completion path and also on the RAID56 check (but that one won't happen in practice due to the invalid combination with zoned mode). Fixes: `7db1c5d14d` ("btrfs: zoned: support dev-replace in zoned filesystems") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-08 19:13:32 +02:00
Filipe Manana	a4527e1853	btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents When doing a direct IO read or write, we always return -ENOTBLK when we find a compressed extent (or an inline extent) so that we fallback to buffered IO. This however is not ideal in case we are in a NOWAIT context (io_uring for example), because buffered IO can block and we currently have no support for NOWAIT semantics for buffered IO, so if we need to fallback to buffered IO we should first signal the caller that we may need to block by returning -EAGAIN instead. This behaviour can also result in short reads being returned to user space, which although it's not incorrect and user space should be able to deal with partial reads, it's somewhat surprising and even some popular applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't deal with short reads properly (or at all). The short read case happens when we try to read from a range that has a non-compressed and non-inline extent followed by a compressed extent. After having read the first extent, when we find the compressed extent we return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to treat the request as a short read, returning 0 (success) and waiting for previously submitted bios to complete (this happens at fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at btrfs_file_read_iter(), we call filemap_read() to use buffered IO to read the remaining data, and pass it the number of bytes we were able to read with direct IO. Than at filemap_read() if we get a page fault error when accessing the read buffer, we return a partial read instead of an -EFAULT error, because the number of bytes previously read is greater than zero. So fix this by returning -EAGAIN for NOWAIT direct IO when we find a compressed or an inline extent. Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/ Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582 Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com> CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-08 19:13:22 +02:00
Roman Gushchin	e33c267ab7	mm: shrinkers: provide shrinkers with names Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:40 -07:00
Linus Torvalds	82708bb1eb	for-5.19-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmK4dV4ACgkQxWXV+ddt WDs4uQ/7B0XqPK05NJntJfwnuIoT/yOreKf47wt/6DyFV3CDMFte/qzaZwthwu6P F0GMpSYAlVszLlML5elvF9VXymlV+e+QROtbD6QCNLNW1IwHA7ZiF5fV/a1Rj930 XSuaDyVFPAK7892RR6yMQ20IeMBuvqiAhXWEzaIJ2tIcAHn+fP+VkY8Nc0aZj3iC mI+ep4n93karDxmnHVGUxJTxAe0l/uNopx+fYBWQDj7HuoMLo0Cu+rAdv0gRIxi2 RWUBkR4e4PBwV1OFScwNCsljjt6bHdUHrtdB3fo5Hzu9cO5hHdL7NEsKB1K2w7rV bgNuNqfj6Y4xUBchAfQO5CCJ9ISci5KoJ4RBpk6EprZR3QN40kN8GPlhi2519K7w F3d8jolDDHlkqxIsqoe47MYOcSepNEadVNsiYKb0rM6doilfxyXiu6dtTFMrC8Vy K2HDCdTyuIgw+TnwqT1puaUwxiIL8DFJf1CVyjwGuQ4UgaIEkHXKIsCssyyJ76Jh QkWX1aeRldbfkVArJWHQWqDQopx9pFBz1gjlws0YjAsU5YijOOXva464P9Rxg+Gq 4pRlgnO48joQam9bRirP2Z6yhqa4O6jkzKDOXSYduAUYD7IMfpsYnz09wKS95jj+ QCrR7VmKnpQdsXg5a/mqyacfIH30ph002VywRxPiFM89Syd25yo= =rUrf -----END PGP SIGNATURE----- Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - zoned relocation fixes: - fix critical section end for extent writeback, this could lead to out of order write - prevent writing to previous data relocation block group if space gets low - reflink fixes: - fix race between reflinking and ordered extent completion - proper error handling when block reserve migration fails - add missing inode iversion/mtime/ctime updates on each iteration when replacing extents - fix deadlock when running fsync/fiemap/commit at the same time - fix false-positive KCSAN report regarding pid tracking for read locks and data race - minor documentation update and link to new site * tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Documentation: update btrfs list of features and link to readthedocs.io btrfs: fix deadlock with fsync+fiemap+transaction commit btrfs: don't set lock_owner when locking extent buffer for reading btrfs: zoned: fix critical section of relocation inode writeback btrfs: zoned: prevent allocation from previous data relocation BG btrfs: do not BUG_ON() on failure to migrate space when replacing extents btrfs: add missing inode updates on each iteration when replacing extents btrfs: fix race between reflinking and ordered extent completion	2022-06-26 10:11:36 -07:00
Linus Torvalds	ff872b76b3	for-5.19-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKxvkkACgkQxWXV+ddt WDsQYhAAofZGaOdBwSDvGA4srB2ieDIFoMeNb1NYp2P5vafPo3Q5AAvgGAeKhp5x g2C7W/8q2GMJ+B9SjyiBkVufuQmCWbFKxStQM3QysYoj/EyKyp7SXtO4YMWHz2T3 nfMMlPo2aNpr7Z2s+tcjhthq/hIvVFi6kweRFNvacM2bb/17IxgAdqLpQBqK5xe9 /IGSUTw75jSd2sZSyzBqrqshKDonmJ7u4qCV2X5hTPi8w4AUDERJrm0bOnikNXHx 4LnNDmSIA0BEXybHwEAShoK0ge66z1kP1UspQNB7pKriJcyroNPjgm/fMZJiRKIc zEYEMSzTYQa5eDwhXCz5PCaPqY4y/ovfYCsmySVXt1a7wgplVl+vsOaesE2NFVCX FE36d58L+4I8iTJhpVCNmEU9N/spfvAr3mBAcKCkbp9WKyGJ9/2yJpRThkV8Pw2Y bzhFIYRs1CJvkK7P4Cp+FSfzJx6tvYAqblvE97VUt83PuqS1Fb49lKdr5DZnbplV vDkewmvXSmHH9Ic5xBeTJXJZ+yeibk/0LSNEKczWva6f60h0ubF0OI6BzmS+NZbN HyitKerX0ZyFi5VUOZ+PKzXfR3ZlX3SmjAcHrl9BjZjFOJkpxAx6yWBzdnkitb+O fYyT68H4IetxwkghPVBv8qFCkuNy/i9NsEILcAAXd8CHGQlfwDA= =eORM -----END PGP SIGNATURE----- Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - print more error messages for invalid mount option values - prevent remount with v1 space cache for subpage filesystem - fix hang during unmount when block group reclaim task is running * tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add error messages to all unrecognized mount options btrfs: prevent remounting to v1 space cache for subpage mount btrfs: fix hang during unmount when block group reclaim task is running	2022-06-21 12:06:04 -05:00
Josef Bacik	bf7ba8ee75	btrfs: fix deadlock with fsync+fiemap+transaction commit We are hitting the following deadlock in production occasionally Task 1 Task 2 Task 3 Task 4 Task 5 fsync(A) start trans start commit falloc(A) lock 5m-10m start trans wait for commit fiemap(A) lock 0-10m wait for 5m-10m (have 0-5m locked) have btrfs_need_log_full_commit !full_sync wait_ordered_extents finish_ordered_io(A) lock 0-5m DEADLOCK We have an existing dependency of file extent lock -> transaction. However in fsync if we tried to do the fast logging, but then had to fall back to committing the transaction, we will be forced to call btrfs_wait_ordered_range() to make sure all of our extents are updated. This creates a dependency of transaction -> file extent lock, because btrfs_finish_ordered_io() will need to take the file extent lock in order to run the ordered extents. Fix this by stopping the transaction if we have to do the full commit and we attempted to do the fast logging. Then attach to the transaction and commit it if we need to. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:47:08 +02:00
Zygo Blaxell	97e86631bc	btrfs: don't set lock_owner when locking extent buffer for reading In `196d59ab9c` "btrfs: switch extent buffer tree lock to rw_semaphore" the functions for tree read locking were rewritten, and in the process the read lock functions started setting eb->lock_owner = current->pid. Previously lock_owner was only set in tree write lock functions. Read locks are shared, so they don't have exclusive ownership of the underlying object, so setting lock_owner to any single value for a read lock makes no sense. It's mostly harmless because write locks and read locks are mutually exclusive, and none of the existing code in btrfs (btrfs_init_new_buffer and print_eb_refs_lock) cares what nonsense is written in lock_owner when no writer is holding the lock. KCSAN does care, and will complain about the data race incessantly. Remove the assignments in the read lock functions because they're useless noise. Fixes: `196d59ab9c` ("btrfs: switch extent buffer tree lock to rw_semaphore") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:46:56 +02:00
Naohiro Aota	19ab78ca86	btrfs: zoned: fix critical section of relocation inode writeback We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to write out to the relocation inode. That critical section must include all the IO submission for the inode. However, flush_write_bio() in extent_writepages() is out of the critical section, causing an IO submission outside of the lock. This leads to an out of the order IO submission and fail the relocation process. Fix it by extending the critical section. Fixes: `35156d8527` ("btrfs: zoned: only allow one process to add pages to a relocation inode") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:46:30 +02:00
Naohiro Aota	343d8a3085	btrfs: zoned: prevent allocation from previous data relocation BG After commit `5f0addf7b8` ("btrfs: zoned: use dedicated lock for data relocation"), we observe IO errors on e.g, btrfs/232 like below. [09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs] <snip> [09.9][T4038707] Call Trace: [09.5][T4038707] <TASK> [09.3][T4038707] run_delalloc_nocow+0x7f1/0x11a0 [btrfs] [09.6][T4038707] ? test_range_bit+0x174/0x320 [btrfs] [09.2][T4038707] ? fallback_to_cow+0x980/0x980 [btrfs] [09.3][T4038707] ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs] [09.5][T4038707] btrfs_run_delalloc_range+0x445/0x1320 [btrfs] [09.2][T4038707] ? test_range_bit+0x320/0x320 [btrfs] [09.4][T4038707] ? lock_downgrade+0x6a0/0x6a0 [09.2][T4038707] ? orc_find.part.0+0x1ed/0x300 [09.5][T4038707] ? __module_address.part.0+0x25/0x300 [09.0][T4038707] writepage_delalloc+0x159/0x310 [btrfs] <snip> [09.4][ C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [09.5][ C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current] [09.9][ C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command [09.5][ C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00 [09.4][ C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0 [09.9][ C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 The IO errors occur when we allocate a regular extent in previous data relocation block group. On zoned btrfs, we use a dedicated block group to relocate a data extent. Thus, we allocate relocating data extents (pre-alloc) only from the dedicated block group and vice versa. Once the free space in the dedicated block group gets tight, a relocating extent may not fit into the block group. In that case, we need to switch the dedicated block group to the next one. Then, the previous one is now freed up for allocating a regular extent. The BG is already not enough to allocate the relocating extent, but there is still room to allocate a smaller extent. Now the problem happens. By allocating a regular extent while nocow IOs for the relocation is still on-going, we will issue WRITE IOs (for relocation) and ZONE APPEND IOs (for the regular writes) at the same time. That mixed IOs confuses the write pointer and arises the unaligned write errors. This commit introduces a new bit 'zoned_data_reloc_ongoing' to the btrfs_block_group. We set this bit before releasing the dedicated block group, and no extent are allocated from a block group having this bit set. This bit is similar to setting block_group->ro, but is different from it by allowing nocow writes to start. Once all the nocow IO for relocation is done (hooked from btrfs_finish_ordered_io), we reset the bit to release the block group for further allocation. Fixes: `c2707a2556` ("btrfs: zoned: add a dedicated data relocation block group") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:43:48 +02:00
Filipe Manana	650c9caba3	btrfs: do not BUG_ON() on failure to migrate space when replacing extents At btrfs_replace_file_extents(), if we fail to migrate reserved metadata space from the transaction block reserve into the local block reserve, we trigger a BUG_ON(). This is because it should not be possible to have a failure here, as we reserved more space when we started the transaction than the space we want to migrate. However having a BUG_ON() is way too drastic, we can perfectly handle the failure and return the error to the caller. So just do that instead, and add a WARN_ON() to make it easier to notice the failure if it ever happens (which is particularly useful for fstests, and the warning will trigger a failure of a test case). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:43:27 +02:00
Filipe Manana	983d8209c6	btrfs: add missing inode updates on each iteration when replacing extents When replacing file extents, called during fallocate, hole punching, clone and deduplication, we may not be able to replace/drop all the target file extent items with a single transaction handle. We may get -ENOSPC while doing it, in which case we release the transaction handle, balance the dirty pages of the btree inode, flush delayed items and get a new transaction handle to operate on what's left of the target range. By dropping and replacing file extent items we have effectively modified the inode, so we should bump its iversion and update its mtime/ctime before we update the inode item. This is because if the transaction we used for partially modifying the inode gets committed by someone after we release it and before we finish the rest of the range, a power failure happens, then after mounting the filesystem our inode has an outdated iversion and mtime/ctime, corresponding to the values it had before we changed it. So add the missing iversion and mtime/ctime updates. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-06-21 14:43:21 +02:00

... 3 4 5 6 7 ...

11395 Commits