mirror of
https://mirrors.bfsu.edu.cn/git/linux.git
synced 2024-11-13 14:24:11 +08:00
626dfed5fa
710 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Josef Bacik
|
c4e54a6571 |
btrfs: replace clearing extent buffer dirty bit with btrfs_clean_block
Now that we're passing in the trans into btrfs_clean_tree_block, we can easily roll in the handling of the !trans case and replace all occurrences of if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) clear_extent_buffer_dirty(eb); with btrfs_tree_lock(eb); btrfs_clean_tree_block(eb); btrfs_tree_unlock(eb); We need the lock because if we are actually dirty we need to make sure we aren't racing with anything that's starting writeout currently. This also makes sure that we're accounting fs_info->dirty_metadata_bytes appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
ed25dab3a0 |
btrfs: add trans argument to btrfs_clean_tree_block
We check the header generation in the extent buffer against the current running transaction id to see if it's safe to clear DIRTY on this buffer. Generally speaking if we're clearing the buffer dirty we're holding the transaction open, but in the case of cleaning up an aborted transaction we don't, so we have extra checks in that path to check the transid. To allow for a future cleanup go ahead and pass in the trans handle so we don't have to rely on ->running_transaction being set. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
235e1c7b87 |
btrfs: use a single variable to track return value for log_dir_items()
We currently use 'ret' and 'err' to track the return value for log_dir_items(), which is confusing and likely the cause for previous bugs where log_dir_items() did not return an error when it should, fixed in previous patches. So change this and use only a single variable, 'ret', to track the return value. This is simpler and makes it similar to most of the existing code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
5cce1780dc |
btrfs: use a negative value for BTRFS_LOG_FORCE_COMMIT
Currently we use the value 1 for BTRFS_LOG_FORCE_COMMIT, but that value has a few inconveniences: 1) If it's ever used by btrfs_log_inode(), or any function down the call chain, we have to remember to btrfs_set_log_full_commit(), which is repetitive and has a chance to be forgotten in future use cases. btrfs_log_inode_parent() only calls btrfs_set_log_full_commit() when it gets a negative value from btrfs_log_inode(); 2) Down the call chain of btrfs_log_inode(), we may have functions that need to force a log commit, but can return either an error (negative value), false (0) or true (1). So they are forced to return some random negative to force a log commit - using BTRFS_LOG_FORCE_COMMIT would make the intention more clear. Currently the only example is flush_dir_items_batch(). So turn BTRFS_LOG_FORCE_COMMIT into a negative value. The chosen value is -(MAX_ERRNO + 1), so that it does not overlap any errno value and makes it easier to debug. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
6afaed53cc |
btrfs: simplify update of last_dir_index_offset when logging a directory
When logging a directory, we always set the inode's last_dir_index_offset to the offset of the last dir index item we found. This is using an extra field in the log context structure, and it makes more sense to update it only after we insert dir index items, and we could directly update the inode's last_dir_index_offset field instead. So make this simpler by updating the inode's last_dir_index_offset only when we actually insert dir index keys in the log tree, and getting rid of the last_dir_item_offset field in the log context structure. Reported-by: David Arendt <admin@prnet.org> Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/ Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com> Link: https://lore.kernel.org/linux-btrfs/Y8voyTXdnPDz8xwY@mail.gmail.com/ Reported-by: Hunter Wardlaw <wardlawhunter@gmail.com> Link: https://bugzilla.suse.com/show_bug.cgi?id=1207231 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216851 CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
09e44868f1 |
btrfs: do not abort transaction on failure to update log root
When syncing a log, if we fail to update a log root in the log root tree, we are aborting the transaction if the failure was not -ENOSPC. This is excessive because there is a chance that a transaction commit can succeed, and therefore avoid to turn the filesystem into RO mode. All we need to be careful about is to mark the log for a full commit, which we already do, to make sure no one commits a super block pointing to an outdated log root tree. So don't abort the transaction if we fail to update a log root in the log root tree, and log an error if the failure is not -ENOSPC, so that it does not go completely unnoticed. CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
16199ad9eb |
btrfs: do not abort transaction on failure to write log tree when syncing log
When syncing the log, if we fail to write log tree extent buffers, we mark the log for a full commit and abort the transaction. However we don't need to abort the transaction, all we really need to do is to make sure no one can commit a superblock pointing to new log tree roots. Just because we got a failure writing extent buffers for a log tree, it does not mean we will also fail to do a transaction commit. One particular case is if due to a bug somewhere, when writing log tree extent buffers, the tree checker detects some corruption and the writeout fails because of that. Aborting the transaction can be very disruptive for a user, specially if the issue happened on a root filesystem. One example is the scenario in the Link tag below, where an isolated corruption on log tree leaves was causing transaction aborts when syncing the log. Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/ CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
94cd63ae67 |
btrfs: add missing setup of log for full commit at add_conflicting_inode()
When logging conflicting inodes, if we reach the maximum limit of inodes,
we return BTRFS_LOG_FORCE_COMMIT to force a transaction commit. However
we don't mark the log for full commit (with btrfs_set_log_full_commit()),
which means that once we leave the log transaction and before we commit
the transaction, some other task may sync the log, which is incomplete
as we have not logged all conflicting inodes, leading to some inconsistent
in case that log ends up being replayed.
So also call btrfs_set_log_full_commit() at add_conflicting_inode().
Fixes:
|
||
Filipe Manana
|
8bb6898da6 |
btrfs: fix directory logging due to race with concurrent index key deletion
Sometimes we log a directory without holding its VFS lock, so while we logging it, dir index entries may be added or removed. This typically happens when logging a dentry from a parent directory that points to a new directory, through log_new_dir_dentries(), or when while logging some other inode we also need to log its parent directories (through btrfs_log_all_parents()). This means that while we are at log_dir_items(), we may not find a dir index key we found before, because it was deleted in the meanwhile, so a call to btrfs_search_slot() may return 1 (key not found). In that case we return from log_dir_items() with a success value (the variable 'err' has a value of 0). This can lead to a few problems, specially in the case where the variable 'last_offset' has a value of (u64)-1 (and it's initialized to that when it was declared): 1) By returning from log_dir_items() with success (0) and a value of (u64)-1 for '*last_offset_ret', we end up not logging any other dir index keys that follow the missing, just deleted, index key. The (u64)-1 value makes log_directory_changes() not call log_dir_items() again; 2) Before returning with success (0), log_dir_items(), will log a dir index range item covering a range from the last old dentry index (stored in the variable 'last_old_dentry_offset') to the value of 'last_offset'. If 'last_offset' has a value of (u64)-1, then it means if the log is persisted and replayed after a power failure, it will cause deletion of all the directory entries that have an index number between last_old_dentry_offset + 1 and (u64)-1; 3) We can end up returning from log_dir_items() with ctx->last_dir_item_offset having a lower value than inode->last_dir_index_offset, because the former is set to the current key we are processing at process_dir_items_leaf(), and at the end of log_directory_changes() we set inode->last_dir_index_offset to the current value of ctx->last_dir_item_offset. So if for example a deletion of a lower dir index key happened, we set ctx->last_dir_item_offset to that index value, then if we return from log_dir_items() because btrfs_search_slot() returned 1, we end up returning from log_dir_items() with success (0) and then log_directory_changes() sets inode->last_dir_index_offset to a lower value than it had before. This can result in unpredictable and unexpected behaviour when we need to log again the directory in the same transaction, and can result in ending up with a log tree leaf that has duplicated keys, as we do batch insertions of dir index keys into a log tree. So fix this by making log_dir_items() move on to the next dir index key if it does not find the one it was looking for. Reported-by: David Arendt <admin@prnet.org> Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/ CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
6d3d970b27 |
btrfs: fix missing error handling when logging directory items
When logging a directory, at log_dir_items(), if we get an error when
attempting to search the subvolume tree for a dir index item, we end up
returning 0 (success) from log_dir_items() because 'err' is left with a
value of 0.
This can lead to a few problems, specially in the case the variable
'last_offset' has a value of (u64)-1 (and it's initialized to that when
it was declared):
1) By returning from log_dir_items() with success (0) and a value of
(u64)-1 for '*last_offset_ret', we end up not logging any other dir
index keys that follow the missing, just deleted, index key. The
(u64)-1 value makes log_directory_changes() not call log_dir_items()
again;
2) Before returning with success (0), log_dir_items(), will log a dir
index range item covering a range from the last old dentry index
(stored in the variable 'last_old_dentry_offset') to the value of
'last_offset'. If 'last_offset' has a value of (u64)-1, then it means
if the log is persisted and replayed after a power failure, it will
cause deletion of all the directory entries that have an index number
between last_old_dentry_offset + 1 and (u64)-1;
3) We can end up returning from log_dir_items() with
ctx->last_dir_item_offset having a lower value than
inode->last_dir_index_offset, because the former is set to the current
key we are processing at process_dir_items_leaf(), and at the end of
log_directory_changes() we set inode->last_dir_index_offset to the
current value of ctx->last_dir_item_offset. So if for example a
deletion of a lower dir index key happened, we set
ctx->last_dir_item_offset to that index value, then if we return from
log_dir_items() because btrfs_search_slot() returned an error, we end up
returning without any error from log_dir_items() and then
log_directory_changes() sets inode->last_dir_index_offset to a lower
value than it had before.
This can result in unpredictable and unexpected behaviour when we
need to log again the directory in the same transaction, and can result
in ending up with a log tree leaf that has duplicated keys, as we do
batch insertions of dir index keys into a log tree.
Fix this by setting 'err' to the value of 'ret' in case
btrfs_search_slot() or btrfs_previous_item() returned an error. That will
result in falling back to a full transaction commit.
Reported-by: David Arendt <admin@prnet.org>
Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/
Fixes:
|
||
Filipe Manana
|
fee4c19937 |
btrfs: fix fscrypt name leak after failure to join log transaction
When logging a new name, we don't expect to fail joining a log transaction
since we know at least one of the inodes was logged before in the current
transaction. However if we fail for some unexpected reason, we end up not
freeing the fscrypt name we previously allocated. So fix that by freeing
the name in case we failed to join a log transaction.
Fixes:
|
||
Filipe Manana
|
3eb4234424 |
btrfs: remove outdated logic from overwrite_item() and add assertion
As of commit
|
||
Filipe Manana
|
3a8d1db341 |
btrfs: unify overwrite_item() and do_overwrite_item()
After commit |
||
Christoph Hellwig
|
103c19723c |
btrfs: split the bio submission path into a separate file
The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
97e3823933 |
btrfs: introduce a bitmap based csum range search function
Although we have an existing function, btrfs_lookup_csums_range(), to find all data checksums for a range, it's based on a btrfs_ordered_sum list. For the incoming RAID56 data checksum verification at RMW time, we don't want to waste time by allocating temporary memory. So this patch will introduce a new helper, btrfs_lookup_csums_bitmap(). It will use bitmap based result, which will be a perfect fit for later RAID56 usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Qu Wenruo
|
789d6a3a87 |
btrfs: concentrate all tree block parentness check parameters into one structure
There are several different tree block parentness check parameters used across several helpers: - level Mandatory - transid Under most cases it's mandatory, but there are several backref cases which skips this check. - owner_root - first_key Utilized by most top-down tree search routine. Otherwise can be skipped. Those four members are not always mandatory checks, and some of them are the same u64, which means if some arguments got swapped compiler will not catch it. Furthermore if we're going to further expand the parentness check, we need to modify quite some helpers just to add one more parameter. This patch will concentrate all these members into a structure called btrfs_tree_parent_check, and pass that structure for the following helpers: - btrfs_read_extent_buffer() - read_tree_block() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
David Sterba
|
e55cf7ca85 |
btrfs: pass btrfs_inode to btrfs_add_delayed_iput
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
aa5d3003dd |
btrfs: move orphan prototypes into orphan.h
Move these out of ctree.h into orphan.h to cut down on code in ctree.h. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
af142b6f44 |
btrfs: move file prototypes to file.h
Move these out of ctree.h into file.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
7c8ede1628 |
btrfs: move file-item prototypes into their own header
Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
f2b39277b8 |
btrfs: move dir-item prototypes into dir-item.h
Move these prototypes out of ctree.h and into their own header file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
David Sterba
|
43dd529abe |
btrfs: update function comments
Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
45c40c8f95 |
btrfs: move root tree prototypes to their own header
Move all the root-tree.c prototypes to root-tree.h, and then update all the necessary files to include the new header. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
a0231804af |
btrfs: move extent-tree helpers into their own header file
Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Omar Sandoval
|
94a48aef49 |
btrfs: extend btrfs_dir_item type to store encryption status
For directories with encrypted files/filenames, we need to store a flag indicating this fact. There's no room in other fields, so we'll need to borrow a bit from dir_type. Since it's now a combination of type and flags, we rename it to dir_flags to reflect its new usage. The new flag, FT_ENCRYPTED, indicates a directory containing encrypted data, which is orthogonal to file type; therefore, add the new flag, and make conversion from directory type to file type strip the flag. As the file types almost never change we can afford to use the bits. Actual usage will be guarded behind an incompat bit, this patch only adds the support for later use by fscrypt. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Sweet Tea Dorminy
|
6db7531882 |
btrfs: use struct fscrypt_str instead of struct qstr
While struct qstr is more natural without fscrypt, since it's provided by dentries, struct fscrypt_str is provided by the fscrypt handlers processing dentries, and is thus more natural in the fscrypt world. Replace all of the struct qstr uses with struct fscrypt_str. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Sweet Tea Dorminy
|
ab3c5c18e8 |
btrfs: setup qstr from dentrys using fscrypt helper
Most places where we get a struct qstr, we are doing so from a dentry. With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt provides a helper to convert a dentry name to the appropriate disk name if necessary. Convert each of the dentry name accesses to use fscrypt_setup_filename(), then convert the resulting fscrypt_name back to an unencrypted qstr. This does not work for nokey names, but the specific locations that could spawn nokey names are noted. At present, since there are no encrypted directories, nothing goes down the filename encryption paths. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Sweet Tea Dorminy
|
e43eec81c5 |
btrfs: use struct qstr instead of name and namelen pairs
Many functions throughout btrfs take name buffer and name length arguments. Most of these functions at the highest level are usually called with these arguments extracted from a supplied dentry's name. But the entire name can be passed instead, making each function a little more elegant. Each function whose arguments are currently the name and length extracted from a dentry is herein converted to instead take a pointer to the name in the dentry. The couple of calls to these calls without a struct dentry are converted to create an appropriate qstr to pass in. Additionally, every function which is only called with a name/len extracted directly from a qstr is also converted. This change has positive effect on stack consumption, frame of many functions is reduced but this will be used in the future for fscrypt related structures. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
David Sterba
|
e2896e7910 |
btrfs: sink gfp_t parameter to btrfs_qgroup_trace_extent
All callers pass GFP_NOFS, we can drop the parameter and use it directly. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
ad1ac5012c |
btrfs: move btrfs_map_token to accessors
This is specific to the item-accessor code, move it out of ctree.h into accessor.h/.c and then update the users to include the new header file. This un-inlines btrfs_init_map_token, however this is only called once per function so it's not critical to be inlined. This also saves 904 bytes of code on a release build. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
c7f13d428e |
btrfs: move fs wide helpers out of ctree.h
We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
796787c978 |
btrfs: do not modify log tree while holding a leaf from fs tree locked
When logging an inode in full mode, or when logging xattrs or when logging the dir index items of a directory, we are modifying the log tree while holding a read lock on a leaf from the fs/subvolume tree. This can lead to a deadlock in rare circumstances, but it is a real possibility, and it was recently reported by syzbot with the following trace from lockdep: WARNING: possible circular locking dependency detected 6.1.0-rc5-next-20221116-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.1/16154 is trying to acquire lock: ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256 but task is already holding lock: ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-log-00){++++}-{3:3}: down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634 __btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135 btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline] btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280 btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline] btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998 btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209 btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021 log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258 copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403 copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873 btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495 btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083 btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921 vfs_fsync_range+0x13e/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2856 [inline] iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128 btrfs_direct_write fs/btrfs/file.c:1536 [inline] btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668 call_write_iter include/linux/fs.h:2160 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #1 (btrfs-tree-00){++++}-{3:3}: __lock_release kernel/locking/lockdep.c:5382 [inline] lock_release+0x371/0x810 kernel/locking/lockdep.c:5688 up_write+0x2a/0x520 kernel/locking/rwsem.c:1614 btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline] btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238 search_leaf fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074 btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133 btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746 btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline] __btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline] __btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153 flush_space+0x147/0xe90 fs/btrfs/space-info.c:728 btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086 process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289 worker_thread+0x669/0x1090 kernel/workqueue.c:2436 kthread+0x2e8/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3097 [inline] check_prevs_add kernel/locking/lockdep.c:3216 [inline] validate_chain kernel/locking/lockdep.c:3831 [inline] __lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055 lock_acquire kernel/locking/lockdep.c:5668 [inline] lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633 __mutex_lock_common kernel/locking/mutex.c:603 [inline] __mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747 __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256 __btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline] btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285 btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554 evict+0x2ed/0x6b0 fs/inode.c:664 dispose_list+0x117/0x1e0 fs/inode.c:697 prune_icache_sb+0xeb/0x150 fs/inode.c:896 super_cache_scan+0x391/0x590 fs/super.c:106 do_shrink_slab+0x464/0xce0 mm/vmscan.c:843 shrink_slab_memcg mm/vmscan.c:912 [inline] shrink_slab+0x388/0x660 mm/vmscan.c:991 shrink_node_memcgs mm/vmscan.c:6088 [inline] shrink_node+0x93d/0x1f30 mm/vmscan.c:6117 shrink_zones mm/vmscan.c:6355 [inline] do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417 try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732 reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393 mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578 try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816 try_charge mm/memcontrol.c:2827 [inline] charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889 __mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910 mem_cgroup_charge include/linux/memcontrol.h:667 [inline] __filemap_add_folio+0x615/0xf80 mm/filemap.c:852 filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934 __filemap_get_folio+0x389/0xd80 mm/filemap.c:1976 pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104 find_or_create_page include/linux/pagemap.h:612 [inline] alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588 btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline] btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988 __btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440 btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595 btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038 btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137 update_log_root fs/btrfs/tree-log.c:2841 [inline] btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064 btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947 vfs_fsync_range+0x13e/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2856 [inline] iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128 btrfs_direct_write fs/btrfs/file.c:1536 [inline] btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668 call_write_iter include/linux/fs.h:2160 [inline] do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735 do_iter_write+0x182/0x700 fs/read_write.c:861 vfs_iter_write+0x74/0xa0 fs/read_write.c:902 iter_file_splice_write+0x745/0xc90 fs/splice.c:686 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x114/0x180 fs/splice.c:931 splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886 do_splice_direct+0x1ab/0x280 fs/splice.c:974 do_sendfile+0xb19/0x1270 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Chain exists of: &delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-log-00); lock(btrfs-tree-00); lock(btrfs-log-00); lock(&delayed_node->mutex); Holding a read lock on a leaf from a fs/subvolume tree creates a nasty lock dependency when we are COWing extent buffers for the log tree and we have two tasks modifying the log tree, with each one in one of the following 2 scenarios: 1) Modifying the log tree triggers an extent buffer allocation while holding a write lock on a parent extent buffer from the log tree. Allocating the pages for an extent buffer, or the extent buffer struct, can trigger inode eviction and finally the inode eviction will trigger a release/remove of a delayed node, which requires taking the delayed node's mutex; 2) Allocating a metadata extent for a log tree can trigger the async reclaim thread and make us wait for it to release enough space and unblock our reservation ticket. The reclaim thread can start flushing delayed items, and that in turn results in the need to lock delayed node mutexes and in the need to write lock extent buffers of a subvolume tree - all this while holding a write lock on the parent extent buffer in the log tree. So one task in scenario 1) running in parallel with another task in scenario 2) could lead to a deadlock, one wanting to lock a delayed node mutex while having a read lock on a leaf from the subvolume, while the other is holding the delayed node's mutex and wants to write lock the same subvolume leaf for flushing delayed items. Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the fs/subvolume leaf and use the clone leaf instead. Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/ CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
26ce911446 |
btrfs: make can_nocow_extent nowait compatible
If we have NOWAIT specified on our IOCB and we're writing into a PREALLOC or NOCOW extent then we need to be able to tell can_nocow_extent that we don't want to wait on any locks or metadata IO. Fix can_nocow_extent to allow for NOWAIT. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
570eb97bac |
btrfs: unify the lock/unlock extent variants
We have two variants of lock/unlock extent, one set that takes a cached state, another that does not. This is slightly annoying, and generally speaking there are only a few places where we don't have a cached state. Simplify this by making lock_extent/unlock_extent the only variant and make it take a cached state, then convert all the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
7059c65831 |
btrfs: simplify adding and replacing references during log replay
During log replay, when adding/replacing inode references, there are two special cases that have special code for them: 1) When we have an inode with two or more hardlinks in the same directory, therefore two or more names encoded in the same inode reference item, and one of the hard links gets renamed to the old name of another hard link - that is, the index number for a name changes. This was added in commit |
||
Filipe Manana
|
30b80f3ce0 |
btrfs: use delayed items when logging a directory
When logging a directory we start by flushing all its delayed items. That results in adding dir index items to the subvolume btree, for new dentries, and removing dir index items from the subvolume btree for any dentries that were deleted. This makes it straightforward to log a directory simply by iterating over all the modified subvolume btree leaves, especially when we used to log both dir index keys and dir item keys (before commit |
||
Filipe Manana
|
5557a069f3 |
btrfs: skip logging parent dir when conflicting inode is not a dir
When we find a conflicting inode (an inode that had the same name and parent directory as the inode we are logging now) that was deleted in the current transaction, we always end up logging its parent directory. This is to deal with the case where the conflicting inode corresponds to a deleted subvolume/snapshot or a directory that had subvolumes/snapshots (or some subdirectory inside it had subvolumes/snapshots, etc), because we can't deal with dropping subvolumes/snapshots during log replay. So if we log the parent directory, and if we are dealing with these special cases, then we fallback to a transaction commit when logging the parent, because its last_unlink_trans will match the current transaction (which gets set and propagated when a subvolume/snapshot is deleted). This change skips the logging of the parent directory when the conflicting inode is not a directory (or a subvolume/snapshot). This is ok because in this case logging the current inode is enough to trigger an unlink of the conflicting inode during log replay. So for a case like this: $ mkdir /mnt/dir $ echo -n "first foo data" > /mnt/dir/foo $ sync $ rm -f /mnt/dir/foo $ echo -n "second foo data" > /mnt/dir/foo $ xfs_io -c "fsync" /mnt/dir/foo We avoid logging parent directory "dir" when logging the new file "foo". In other cases it avoids falling back to a transaction commit, when the parent directory has a last_unlink_trans value that matches the current transaction, due to moving a file from it to some other directory. This is a case that happens frequently with dbench for example, where a new file that has the name/parent of another file that was deleted in the current transaction, is fsynced. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
e09d94c9e4 |
btrfs: log conflicting inodes without holding log mutex of the initial inode
When logging an inode, if we detect the inode has a reference that conflicts with some other inode that got renamed, we log that other inode while holding the log mutex of the current inode. We then find out if there are other inodes that conflict with the first conflicting inode, and log them while under the log mutex of the original inode. This is fine because the recursion can only happen once. For the upcoming work where we directly log delayed items without flushing them first to the subvolume tree, this recursion adds a lot of complexity and it's hard to keep lockdep happy about it. So collect a list of conflicting inodes and then log the inodes after unlocking the log mutex of the inode we started with. Also limit the maximum number of conflict inodes we log to 10, to avoid spending too much time logging (and maybe allocating too many list elements too), as typically we don't have more than 1 or 2 conflicting inodes - if we go over the limit, simply fallback to a transaction commit. It is possible to have a very long list of conflicting inodes to be intentionally created by a user if he/she creates a very long succession of renames like this: (...) rename E to F rename D to E rename C to D rename B to C rename A to B touch A (create a new file named A) fsync A If that happened for a sequence of hundreds or thousands of renames, it could massively slow down the logging and cause other secondary effects like for example blocking other fsync operations and transaction commits for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM first). However such cases are very uncommon to happen in practice, nevertheless it's better to be prepared for them and avoid chaos. Such long sequence of conflicting inodes could be created before this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
f6d86dbeba |
btrfs: move log_new_dir_dentries() above btrfs_log_inode()
The static function log_new_dir_dentries() is currently defined below btrfs_log_inode(), but in an upcoming patch a new function is introduced that is called by btrfs_log_inode() and this new function needs to call log_new_dir_dentries(). So move log_new_dir_dentries() to a location between btrfs_log_inode() and need_log_inode() (the later is called by log_new_dir_dentries()). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
a375102426 |
btrfs: move need_log_inode() to above log_conflicting_inodes()
The static function need_log_inode() is defined below btrfs_log_inode() and log_conflicting_inodes(), but in the next patches in the series we will need to call need_log_inode() in a couple new functions that will be used by btrfs_log_inode(). So move its definition to a location above log_conflicting_inodes(). Also make its arguments 'const', since they are not supposed to be modified. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
193df62457 |
btrfs: search for last logged dir index if it's not cached in the inode
The key offset of the last dir index item that was logged is stored in the inode's last_dir_index_offset field. However that field is not persisted in the inode item or elsewhere, so if the inode gets evicted and reloaded, it gets a value of (u64)-1, so that when we are logging dir index items we check if they were logged before, to avoid attempts to insert duplicated keys and fallback to a transaction commit. Improve on this by searching for the last dir index that was logged when we start logging a directory if the inode's last_dir_index_offset is not set (has a value of (u64)-1) and it was logged before. This avoids checking if each dir index item we find was already logged before, and simplifies the logging of dir index items (process_dir_items_leaf()). This will also be needed for an incoming change where we start logging delayed items directly, without flushing them first. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
009d9bea49 |
btrfs: avoid memory allocation at log_new_dir_dentries() for common case
At log_new_dir_dentries() we always start by allocating a list element for the starting inode and then do a while loop with the condition being a list emptiness check. This however is not needed, we can avoid allocating this initial list element and then just check for the list emptiness at the end of the loop's body. So just do that to save one memory allocation from the kmalloc-32 slab. This allows for not doing any memory allocation when we don't have any subdirectory to log, which is a very common case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
4008481343 |
btrfs: free list element sooner at log_new_dir_dentries()
At log_new_dir_dentries(), there's no need to keep the current list element allocated while processing the leaves with directory items for the current directory, and while logging other inodes. Plus in case we find a subdirectory, we also end up allocating a new list element while the current one is still allocated, temporarily using more memory than necessary. So free the current list element early on, before processing leaves. Also make the removal and release of all list elements in case of an error more simple by eliminating the label and goto, adding an explicit loop to release all list elements in case an error happens. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
b96c552b99 |
btrfs: update stale comment for log_new_dir_dentries()
The comment refers to the function log_dir_items() in order to check why
the inodes of new directory entries need to be logged, but the relevant
comments are no longer at log_dir_items(), they were moved to the function
process_dir_items_leaf() in commit
|
||
Filipe Manana
|
8786a6d740 |
btrfs: remove the root argument from log_new_dir_dentries()
There's no point in passing a root argument to log_new_dir_dentries() because it always corresponds to the root of the given inode. So remove it and extract the root from the given inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
04fc7d5123 |
btrfs: don't drop dir index range items when logging a directory
When logging a directory that was previously logged in the current
transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key
type). This is because we will process all leaves in the subvolume's tree
that were changed in the current transaction and then add range items for
covering new dir index items and deleted dir index items, which could
cover now a larger range than before.
We used to fail if we tried to insert a range item key that already
exists, so we dropped all range items to avoid failing. However nowadays,
since commit
|
||
Omar Sandoval
|
d1f68ba069 |
btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent()
btrfs_insert_file_extent() is only ever used to insert holes, so rename it and remove the redundant parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
769030e118 |
btrfs: fix warning during log replay when bumping inode link count
During log replay, at add_link(), we may increment the link count of another inode that has a reference that conflicts with a new reference for the inode currently being processed. During log replay, at add_link(), we may drop (unlink) a reference from some inode in the subvolume tree if that reference conflicts with a new reference found in the log for the inode we are currently processing. After the unlink, If the link count has decreased from 1 to 0, then we increment the link count to prevent the inode from being deleted if it's evicted by an iput() call, because we may have references to add to that inode later on (and we will fixup its link count later during log replay). However incrementing the link count from 0 to 1 triggers a warning: $ cat fs/inode.c (...) void inc_nlink(struct inode *inode) { if (unlikely(inode->i_nlink == 0)) { WARN_ON(!(inode->i_state & I_LINKABLE)); atomic_long_dec(&inode->i_sb->s_remove_count); } (...) The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's never set during log replay. Most of the time, the warning isn't triggered even if we dropped the last reference of the conflicting inode, and this is because: 1) The conflicting inode was previously marked for fixup, through a call to link_to_fixup_dir(), which increments the inode's link count; 2) And the last iput() on the inode has not triggered eviction of the inode, nor was eviction triggered after the iput(). So at add_link(), even if we unlink the last reference of the inode, its link count ends up being 1 and not 0. So this means that if eviction is triggered after link_to_fixup_dir() is called, at add_link() we will read the inode back from the subvolume tree and have it with a correct link count, matching the number of references it has on the subvolume tree. So if when we are at add_link() the inode has exactly one reference only, its link count is 1, and after the unlink its link count becomes 0. So fix this by using set_nlink() instead of inc_nlink(), as the former accepts a transition from 0 to 1 and it's what we use in other similar contexts (like at link_to_fixup_dir(). Also make add_inode_ref() use set_nlink() instead of inc_nlink() to bump the link count from 0 to 1. The warning is actually harmless, but it may scare users. Josef also ran into it recently. CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
7a6b75b799 |
btrfs: fix lost error handling when looking up extended ref on log replay
During log replay, when processing inode references, if we get an error
when looking up for an extended reference at __add_inode_ref(), we ignore
it and proceed, returning success (0) if no other error happens after the
lookup. This is obviously wrong because in case an extended reference
exists and it encodes some name not in the log, we need to unlink it,
otherwise the filesystem state will not match the state it had after the
last fsync.
So just make __add_inode_ref() return an error it gets from the extended
reference lookup.
Fixes:
|
||
Filipe Manana
|
723df2bcc9 |
btrfs: join running log transaction when logging new name
When logging a new name, in case of a rename, we pin the log before changing it. We then either delete a directory entry from the log or insert a key range item to mark the old name for deletion on log replay. However when doing one of those log changes we may have another task that started writing out the log (at btrfs_sync_log()) and it started before we pinned the log root. So we may end up changing a log tree while its writeback is being started by another task syncing the log. This can lead to inconsistencies in a log tree and other unexpected results during log replay, because we can get some committed node pointing to a node/leaf that ends up not getting written to disk before the next log commit. The problem, conceptually, started to happen in commit |
||
Josef Bacik
|
f31f09f6be |
btrfs: tree-log: make the return value for log syncing consistent
Currently we will return 1 or -EAGAIN if we decide we need to commit the transaction rather than sync the log. In practice this doesn't really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to commit the transaction. However this makes it hard to figure out what the correct thing to do is. Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the places where we want to force the transaction to be committed. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
David Sterba
|
143823cf4d |
btrfs: fix typos in comments
Codespell has found a few typos. Signed-off-by: David Sterba <dsterba@suse.com> |
||
Lv Ruyi
|
8aa1e49ea1 |
btrfs: remove unnecessary check of iput argument
iput() already handles NULL and non-NULL parameter, so it is not needed to check that. This unifies all iput calls. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
6a2e9dc46f |
btrfs: remove trivial wrapper btrfs_read_buffer()
The function btrfs_read_buffer() is useless, it just calls btree_read_extent_buffer_pages() with exactly the same arguments. So remove it and rename btree_read_extent_buffer_pages() to btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_" prefix (since it's used outside disk-io.c) and the name is clear enough about what it does. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
750ee45490 |
btrfs: fix assertion failure when logging directory key range item
When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging
a directory, we don't expect the insertion to fail with -EEXIST, because
we are holding the directory's log_mutex and we have dropped all existing
BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log
the directory. However it's possible that during the logging we attempt
to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to
happen we need to race with insertions of items from other inodes in the
subvolume's tree while we are logging a directory. Here's how this can
happen:
1) We are logging a directory with inode number 1000 that has its items
spread across 3 leaves in the subvolume's tree:
leaf A - has index keys from the range 2 to 20 for example. The last
item in the leaf corresponds to a dir item for index number 20. All
these dir items were created in a past transaction.
leaf B - has index keys from the range 22 to 100 for example. It has
no keys from other inodes, all its keys are dir index keys for our
directory inode number 1000. Its first key is for the dir item with
a sequence number of 22. All these dir items were also created in a
past transaction.
leaf C - has index keys for our directory for the range 101 to 120 for
example. This leaf also has items from other inodes, and its first
item corresponds to the dir item for index number 101 for our directory
with inode number 1000;
2) When we finish processing the items from leaf A at log_dir_items(),
we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last
offset of 21, meaning the log is authoritative for the index range
from 21 to 21 (a single sequence number). At this point leaf B was
not yet modified in the current transaction;
3) When we return from log_dir_items() we have released our read lock on
leaf B, and have set *last_offset_ret to 21 (index number of the first
item on leaf B minus 1);
4) Some other task inserts an item for other inode (inode number 1001 for
example) into leaf C. That resulted in pushing some items from leaf C
into leaf B, in order to make room for the new item, so now leaf B
has dir index keys for the sequence number range from 22 to 102 and
leaf C has the dir items for the sequence number range 103 to 120;
5) At log_directory_changes() we call log_dir_items() again, passing it
a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3
plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0
of leaf B, since leaf B was modified in the current transaction.
We have also initialized 'last_old_dentry_offset' to 20 after calling
btrfs_previous_item() at log_dir_items(), as it left us at the last
item of leaf A, which refers to the dir item with sequence number 20;
6) We then call process_dir_items_leaf() to process the dir items of
leaf B, and when we process the first item, corresponding to slot 0,
sequence number 22, we notice the dir item was created in a past
transaction and its sequence number is greater than the value of
*last_old_dentry_offset + 1 (20 + 1), so we decide to log again a
BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range
of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST
error from insert_dir_log_key(), as we have already inserted that
key at step 2, triggering the assertion at process_dir_items_leaf().
The trace produced in dmesg is like the following:
assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857
[198255.980839][ T7460] ------------[ cut here ]------------
[198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617!
[198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+
[198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e
[198255.989465][ T7460] Code: 8b 4c 89 (...)
[198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282
[198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000
[198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24
[198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507
[198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef
[198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358
[198256.007082][ T7460] FS: 00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000
[198256.009939][ T7460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0
[198256.015239][ T7460] Call Trace:
[198256.015674][ T7460] <TASK>
[198256.016313][ T7460] log_dir_items.cold+0x16/0x2c
[198256.018858][ T7460] ? replay_one_extent+0xbf0/0xbf0
[198256.025932][ T7460] ? release_extent_buffer+0x1d2/0x270
[198256.029658][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.031114][ T7460] ? lock_acquired+0xbe/0x660
[198256.032633][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.034386][ T7460] ? lock_release+0xcf/0x8a0
[198256.036152][ T7460] log_directory_changes+0xf9/0x170
[198256.036993][ T7460] ? log_dir_items+0xba0/0xba0
[198256.037661][ T7460] ? do_raw_write_unlock+0x7d/0xe0
[198256.038680][ T7460] btrfs_log_inode+0x233b/0x26d0
[198256.041294][ T7460] ? log_directory_changes+0x170/0x170
[198256.042864][ T7460] ? btrfs_attach_transaction_barrier+0x60/0x60
[198256.045130][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.046568][ T7460] ? lock_release+0xcf/0x8a0
[198256.047504][ T7460] ? lock_downgrade+0x420/0x420
[198256.048712][ T7460] ? ilookup5_nowait+0x81/0xa0
[198256.049747][ T7460] ? lock_downgrade+0x420/0x420
[198256.050652][ T7460] ? do_raw_spin_unlock+0xa9/0x100
[198256.051618][ T7460] ? __might_resched+0x128/0x1c0
[198256.052511][ T7460] ? __might_sleep+0x66/0xc0
[198256.053442][ T7460] ? __kasan_check_read+0x11/0x20
[198256.054251][ T7460] ? iget5_locked+0xbd/0x150
[198256.054986][ T7460] ? run_delayed_iput_locked+0x110/0x110
[198256.055929][ T7460] ? btrfs_iget+0xc7/0x150
[198256.056630][ T7460] ? btrfs_orphan_cleanup+0x4a0/0x4a0
[198256.057502][ T7460] ? free_extent_buffer+0x13/0x20
[198256.058322][ T7460] btrfs_log_inode+0x2654/0x26d0
[198256.059137][ T7460] ? log_directory_changes+0x170/0x170
[198256.060020][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.060930][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.061905][ T7460] ? lock_contended+0x770/0x770
[198256.062682][ T7460] ? btrfs_log_inode_parent+0xd04/0x1750
[198256.063582][ T7460] ? lock_downgrade+0x420/0x420
[198256.064432][ T7460] ? preempt_count_sub+0x18/0xc0
[198256.065550][ T7460] ? __mutex_lock+0x580/0xdc0
[198256.066654][ T7460] ? stack_trace_save+0x94/0xc0
[198256.068008][ T7460] ? __kasan_check_write+0x14/0x20
[198256.072149][ T7460] ? __mutex_unlock_slowpath+0x12a/0x430
[198256.073145][ T7460] ? mutex_lock_io_nested+0xcd0/0xcd0
[198256.074341][ T7460] ? wait_for_completion_io_timeout+0x20/0x20
[198256.075345][ T7460] ? lock_downgrade+0x420/0x420
[198256.076142][ T7460] ? lock_contended+0x770/0x770
[198256.076939][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0
[198256.078401][ T7460] ? btrfs_sync_file+0x5e6/0xa40
[198256.080598][ T7460] btrfs_log_inode_parent+0x523/0x1750
[198256.081991][ T7460] ? wait_current_trans+0xc8/0x240
[198256.083320][ T7460] ? lock_downgrade+0x420/0x420
[198256.085450][ T7460] ? btrfs_end_log_trans+0x70/0x70
[198256.086362][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.087544][ T7460] ? lock_release+0xcf/0x8a0
[198256.088305][ T7460] ? lock_downgrade+0x420/0x420
[198256.090375][ T7460] ? dget_parent+0x8e/0x300
[198256.093538][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0
[198256.094918][ T7460] ? lock_downgrade+0x420/0x420
[198256.097815][ T7460] ? do_raw_spin_unlock+0xa9/0x100
[198256.101822][ T7460] ? dget_parent+0xb7/0x300
[198256.103345][ T7460] btrfs_log_dentry_safe+0x48/0x60
[198256.105052][ T7460] btrfs_sync_file+0x629/0xa40
[198256.106829][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120
[198256.109655][ T7460] ? __fget_files+0x161/0x230
[198256.110760][ T7460] vfs_fsync_range+0x6d/0x110
[198256.111923][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120
[198256.113556][ T7460] __x64_sys_fsync+0x45/0x70
[198256.114323][ T7460] do_syscall_64+0x5c/0xc0
[198256.115084][ T7460] ? syscall_exit_to_user_mode+0x3b/0x50
[198256.116030][ T7460] ? do_syscall_64+0x69/0xc0
[198256.116768][ T7460] ? do_syscall_64+0x69/0xc0
[198256.117555][ T7460] ? do_syscall_64+0x69/0xc0
[198256.118324][ T7460] ? sysvec_call_function_single+0x57/0xc0
[198256.119308][ T7460] ? asm_sysvec_call_function_single+0xa/0x20
[198256.120363][ T7460] entry_SYSCALL_64_after_hwframe+0x44/0xae
[198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab
[198256.122067][ T7460] Code: 0f 05 48 (...)
[198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab
[198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004
[198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8
[198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001
[198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8
[198256.133403][ T7460] </TASK>
Fix this by treating -EEXIST as expected at insert_dir_log_key() and have
it update the item with an end offset corresponding to the maximum between
the previously logged end offset and the new requested end offset. The end
offsets may be different due to dir index key deletions that happened as
part of unlink operations while we are logging a directory (triggered when
fsyncing some other inode parented by the directory) or during renames
which always attempt to log a single dir index deletion.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/
Fixes:
|
||
Filipe Manana
|
d0e64a981f |
btrfs: always log symlinks in full mode
On Linux, empty symlinks are invalid, and attempting to create one with the system call symlink(2) results in an -ENOENT error and this is explicitly documented in the man page. If we rename a symlink that was created in the current transaction and its parent directory was logged before, we actually end up logging the symlink without logging its content, which is stored in an inline extent. That means that after a power failure we can end up with an empty symlink, having no content and an i_size of 0 bytes. It can be easily reproduced like this: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ sync # Create a file inside the directory and fsync the directory. $ touch /mnt/testdir/foo $ xfs_io -c "fsync" /mnt/testdir # Create a symlink inside the directory and then rename the symlink. $ ln -s /mnt/testdir/foo /mnt/testdir/bar $ mv /mnt/testdir/bar /mnt/testdir/baz # Now fsync again the directory, this persist the log tree. $ xfs_io -c "fsync" /mnt/testdir <power failure> $ mount /dev/sdc /mnt $ stat -c %s /mnt/testdir/baz 0 $ readlink /mnt/testdir/baz $ Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that their content is also logged. A test case for fstests will follow. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
50ff57888d |
btrfs: fix leaked plug after failure syncing log on zoned filesystems
On a zoned filesystem, if we fail to allocate the root node for the log
root tree while syncing the log, we end up returning without finishing
the IO plug we started before, resulting in leaking resources as we
have started writeback for extent buffers of a log tree before. That
allocation failure, which typically is either -ENOMEM or -ENOSPC, is not
fatal and the fsync can safely fallback to a full transaction commit.
So release the IO plug if we fail to allocate the extent buffer for the
root of the log root tree when syncing the log on a zoned filesystem.
Fixes:
|
||
Filipe Manana
|
313ab75399 |
btrfs: add and use helper for unlinking inode during log replay
During log replay there is this pattern of running delayed items after every inode unlink. To avoid repeating this several times, move the logic into an helper function and use it instead of calling btrfs_unlink_inode() followed by btrfs_run_delayed_items(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
23e3337faf |
btrfs: reset last_reflink_trans after fsyncing inode
When an inode has a last_reflink_trans matching the current transaction, we have to take special care when logging its checksums in order to avoid getting checksum items with overlapping ranges in a log tree, which could result in missing checksums after log replay (more on that in the changelogs of commit |
||
Filipe Manana
|
96acb3753e |
btrfs: voluntarily relinquish cpu when doing a full fsync
Doing a full fsync may require processing many leaves of metadata, which can take some time and result in a task monopolizing a cpu for too long. So add a cond_resched() after processing a leaf when doing a full fsync, while not holding any locks on any tree (a subvolume or a log tree). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
5b7ce5e287 |
btrfs: hold on to less memory when logging checksums during full fsync
When doing a full fsync, at copy_items(), we iterate over all extents and then collect their checksums into a list. After copying all the extents to the log tree, we then log all the previously collected checksums. Before the previous patch in the series (subject "btrfs: stop copying old file extents when doing a full fsync"), we had to do it this way, because while we were iterating over the items in the leaf of the subvolume tree, we were holding a write lock on a leaf of the log tree, so logging the checksums for an extent right after we collected them could result in a deadlock, in case the checksum items ended up in the same leaf. However after the previous patch in the series we now do a first iteration over all the items in the leaf of the subvolume tree before locking a path in the log tree, so we can now log the checksums right after we have obtained them. This avoids holding in memory all checksums for all extents in the leaf while copying items from the source leaf to the log tree. The amount of memory used to hold all checksums of the extents in a leaf can be significant. For example if a leaf has 200 file extent items referring to 1M extents, using the default crc32c checksums, would result in using over 200K of memory (not accounting for the extra overhead of struct btrfs_ordered_sum), with smaller or less extents it would be less, but it could be much more with more extents per leaf and/or much larger extents. So change copy_items() to log the checksums for an extent after looking them up, and then free their memory, as they are no longer necessary. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
7f30c07288 |
btrfs: stop copying old file extents when doing a full fsync
When logging an inode in full sync mode, we go over every leaf that was modified in the current transaction and has items associated to our inode, and then copy all those items into the log tree. This includes copying file extent items that were created and added to the inode in past transactions, which is useless and only makes use more leaf space in the log tree. It's common to have a file with many file extent items spanning many leaves where only a few file extent items are new and need to be logged, and in such case we log all the file extent items we find in the modified leaves. So change the full sync behaviour to skip over file extent items that are not needed. Those are the ones that match the following criteria: 1) Have a generation older than the current transaction and the inode was not a target of a reflink operation, as that can copy file extent items from a past generation from some other inode into our inode, so we have to log them; 2) Start at an offset within i_size - we must log anything at or beyond i_size, otherwise we would lose prealloc extents after log replay. The following script exercises a scenario where this happens, and it's somehow close enough to what happened often on a SQL Server workload which I had to debug sometime ago to fix an issue where a pattern of writes to prealloc extents and fsync resulted in fsync failing with -EIO (that was commit |
||
Filipe Manana
|
e1f53ed874 |
btrfs: prepare extents to be logged before locking a log tree path
When we want to log an extent, in the fast fsync path, we obtain a path to the leaf that will hold the file extent item either through a deletion search, via btrfs_drop_extents(), or through an insertion search using btrfs_insert_empty_item(). After that we fill the file extent item's fields one by one directly on the leaf. Instead of doing that, we could prepare the file extent item before obtaining a btree path, and then copy the prepared extent item with a single operation once we get the path. This helps avoid some contention on the log tree, since we are holding write locks for longer than necessary, especially in the case where the path is obtained via btrfs_drop_extents() through a deletion search, which always keeps a write lock on the nodes at levels 1 and 2 (besides the leaf). This change does that, we prepare the file extent item that is going to be inserted before acquiring a path, and then copy it into a leaf using a single copy operation once we get a path. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The following test was run to measure the impact of the whole patchset: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" NUM_JOBS=8 FILE_SIZE=128M RUN_TIME=200 cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=1 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=sync filesize=$FILE_SIZE runtime=$RUN_TIME time_based directory=$MNT numjobs=$NUM_JOBS thread EOF echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test ran inside a VM (8 cores, 32G of RAM) with the target disk mapping to a raw NVMe device, and using a non-debug kernel config (Debian's default config). Before the patchset: WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec After the patchset: WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec A 7.8% gain on throughput and +7.0% more IO done in the same period of time (200 seconds). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
d845753170 |
btrfs: remove useless path release in the fast fsync path
There's no point in calling btrfs_release_path() after finishing the loop that logs the modified extents, since log_one_extent() returns with the path released. In case the list of extents is empty, the path is already released, so there's no need for that case as well. So just remove that unnecessary btrfs_release_path() call. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
65faced5b9 |
btrfs: use single variable to track return value at btrfs_log_inode()
At btrfs_log_inode(), we have two variables to track errors and the return value of the function, named 'ret' and 'err'. In some places we use 'ret' and if gets a non-zero value we assign its value to 'err' and then jump to the 'out' label, while in other places we use 'err' directly without 'ret' as an intermediary. This is inconsistent, error prone and not necessary. So change that to use only the 'ret' variable, making this consistent with most functions in btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
0f8ce49821 |
btrfs: avoid inode logging during rename and link when possible
During a rename or link operation, we need to determine if an inode was previously logged or not, and if it was, do some update to the logged inode. We used to rely exclusively on the logged_trans field of struct btrfs_inode to determine that, but that was not reliable because the value of that field is not persisted in the inode item, so it's lost when an inode is evicted and loaded back again. That led to several issues in the past, such as not persisting deletions (such as the case fixed by commit |
||
Filipe Manana
|
259c4b96d7 |
btrfs: stop doing unnecessary log updates during a rename
During a rename, we call __btrfs_unlink_inode(), which will call btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order to remove an inode reference and a directory entry from the log. These are necessary when __btrfs_unlink_inode() is called from the unlink path, but not necessary when it's called from a rename context, because: 1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the inode reference related to the old name, because later in the rename path we call btrfs_log_new_name(), which will drop all inode references from the log and copy all inode references from the subvolume tree to the log tree. So we are doing one unnecessary btree operation which adds additional latency and lock contention in case there are other tasks accessing the log tree; 2) For the btrfs_del_dir_entries_in_log() call, we are now doing the equivalent at btrfs_log_new_name() since the previous patch in the series, that has the subject "btrfs: avoid logging all directory changes during renames". In fact, having __btrfs_unlink_inode() call this function not only adds additional latency and lock contention due to the extra btree operation, but also can make btrfs_log_new_name() unnecessarily log a range item to track the deletion of the old name, since it has no way to known that the directory entry related to the old name was previously logged and already deleted by __btrfs_unlink_inode() through its call to btrfs_del_dir_entries_in_log(). So skip those calls at __btrfs_unlink_inode() when we are doing a rename. Skipping them also allows us now to reduce the duration of time we are pinning a log transaction during renames, which is always beneficial as it's not delaying so much other tasks trying to sync the log tree, in particular we end up not holding the log transaction pinned while adding the new name (adding inode ref, directory entry, etc). This change is part of a patchset comprised of the following patches: 1/5 btrfs: add helper to delete a dir entry from a log tree 2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode 3/5 btrfs: avoid logging all directory changes during renames 4/5 btrfs: stop doing unnecessary log updates during a rename 5/5 btrfs: avoid inode logging during rename and link when possible Just like the previous patch in the series, "btrfs: avoid logging all directory changes during renames", the following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box a using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before patchset: 27399 ms NUM_FILES=10000, after patches 1/5 to 3/5 applied: 9093 ms (-66.8%) NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9016 ms (-67.1%) NUM_FILES=5000, before patchset: 9241 ms NUM_FILES=5000, after patches 1/5 to 3/5 applied: 4642 ms (-49.8%) NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4553 ms (-50.7%) NUM_FILES=2000, before patchset: 2550 ms NUM_FILES=2000, after patches 1/5 to 3/5 applied: 1788 ms (-29.9%) NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1767 ms (-30.7%) NUM_FILES=1000, before patchset: 1088 ms NUM_FILES=1000, after patches 1/5 to 3/5 applied: 905 ms (-16.9%) NUM_FILES=1000, after patches 1/5 to 4/5 applied: 883 ms (-18.8%) The next patch in the series (5/5), also contains dbench results after applying to whole patchset. Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
88d2beec7e |
btrfs: avoid logging all directory changes during renames
When doing a rename of a file, if the file or its old parent directory were logged before, we log the new name of the file and then make sure we log the old parent directory, to ensure that after a log replay the old name of the file is deleted and the new name added. The logging of the old parent directory can take some time, because it will scan all leaves modified in the current transaction, check which directory entries were already logged, copy the ones that were not logged before, etc. In this rename context all we need to do is make sure that the old name of the file is deleted on log replay, so instead of triggering a directory log operation, we can just delete the old directory entry from the log if it's there, or in case it isn't there, just log a range item to signal log replay that the old name must be deleted. So change btrfs_log_new_name() to do that. This scenario is actually not uncommon to trigger, and recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package installations and upgrades, with the zypper tool, were often taking a long time to complete, much more than usual. With strace it could be observed that zypper was spending over 99% of its time on rename operations, and then with further analysis we checked that directory logging was happening too frequently and causing high latencies for the rename operations. Taking into account that installation/upgrade of some of these packages needed about a few thousand file renames, the slowdown was very noticeable for the user. The issue was caused indirectly due to an excessive number of inode evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure, in an efficient way, if an inode was previously logged in the current transaction, so we are pessimistic and assume it was, because in case it was we need to update the logged inode. More details on that in one of the patches in the same series (subject "btrfs: avoid inode logging during rename and link when possible"). Either way, in case the parent directory was logged before, we currently do more work then necessary during a rename, and this change minimizes that amount of work. The following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before this patch: 27399 ms NUM_FILES=10000, after this patch: 9093 ms (-66.8%) NUM_FILES=5000, before this patch: 9241 ms NUM_FILES=5000, after this patch: 4642 ms (-49.8%) NUM_FILES=2000, before this patch: 2550 ms NUM_FILES=2000, after this patch: 1788 ms (-29.9%) NUM_FILES=1000, before this patch: 1088 ms NUM_FILES=1000, after this patch: 905 ms (-16.9%) Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
d5f5bd5465 |
btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
In the next patch in the series, there will be the need to access the old name, and its length, of an inode when logging the inode during a rename. So instead of passing the inode to btrfs_log_new_name() pass the dentry, because from the dentry we can get the inode, the name and its length. This will avoid passing 3 new parameters to btrfs_log_new_name() in the next patch - the name, its length and an index number. This way we end up passing only 1 new parameter, the index number. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
839061fe88 |
btrfs: add helper to delete a dir entry from a log tree
Move the code that finds and deletes a logged dir entry out of btrfs_del_dir_entries_in_log() into a helper function. This new helper function will be used by another patch in the same series, and serves to avoid having duplicated logic. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
de6bc7f598 |
btrfs: stop trying to log subdirectories created in past transactions
When logging a directory we are trying to log subdirectories that were
changed in the current transaction and created in a past transaction.
This type of behaviour was introduced by commit
|
||
Filipe Manana
|
732d591a5d |
btrfs: stop copying old dir items when logging a directory
When logging a directory, we go over every leaf of the subvolume tree that was changed in the current transaction and copy all its dir index keys to the log tree. That includes copying dir index keys created in past transactions. This is done mostly for simplicity, as after logging the keys we log an item that specifies the start and end ranges of the keys we logged. That item is then used during log replay to figure out which keys need to be deleted - every key in that range that we find in the subvolume tree and is not in the log tree, needs to be deleted. Now that we log only dir index keys, and not dir item keys anymore, when we remove dentries from a directory (due to unlink and rename operations), we can get entire leaves that we changed only for deleting old dir index keys, or that have few dir index keys that are new - this is due to the fact that the offset for new index keys comes from a monotonically increasing counter. We can avoid logging dir index keys from past transactions, and in order to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key type) when we find gaps between consecutive index keys. This massively reduces the amount of logged metadata when we have deleted directory entries, even if it's a small percentage of the total number of entries. The reduction comes from both less items that are logged and instead of logging many dir index items (struct btrfs_dir_item), which have a size of 30 bytes plus a file name, we typically log just a few range items (struct btrfs_dir_log_item), which take only 8 bytes each. Even if no entries were deleted from a directory and only new entries were added, we typically still get a reduction on the amount of logged metadata, because it's very likely the first leaf that got the new dir index entries also has several old dir index entries. So change the logging logic to not log dir index keys created in past transactions and log a range item for every gap it finds between each pair of consecutive index keys, to ensure deletions are tracked and replayed on log replay. This patch is part of a patchset comprised of the following patches: 1/4 btrfs: don't log unnecessary boundary keys when logging directory 2/4 btrfs: put initial index value of a directory in a constant 3/4 btrfs: stop copying old dir items when logging a directory 4/4 btrfs: stop trying to log subdirectories created in past transactions The following test was run on a branch without this patchset and on a branch with the first three patches applied: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=1000000 NUM_FILE_DELETES=10000 MKFS_OPTIONS="-O no-holes -R free-space-tree" MOUNT_OPTIONS="-o ssd" mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" echo umount $MNT The test was run on a non-debug kernel (Debian's default kernel config), and the results were the following for various values of NUM_FILES and NUM_FILE_DELETES: ** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 ** dir fsync took 585 ms after deleting 10000 files ** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 ** dir fsync took 34 ms after deleting 10000 files (-94.2%) ** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 ** dir fsync took 50 ms after deleting 1000 files ** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 ** dir fsync took 7 ms after deleting 1000 files (-86.0%) ** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 ** dir fsync took 9 ms after deleting 100 files ** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 ** dir fsync took 5 ms after deleting 100 files (-44.4%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
a450a4af74 |
btrfs: don't log unnecessary boundary keys when logging directory
Before we start to log dir index keys from a leaf, we check if there is a previous index key, which normally is at the end of a leaf that was not changed in the current transaction. Then we log that key and set the start of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of that key. This is to ensure that if there were deleted index keys between that key and the first key we are going to log, those deletions are replayed in case we need to replay to the log after a power failure. However we really don't need to log that previous key, we can just set the start of the logged range to that key's offset plus 1. This achieves the same and avoids logging one dir index key. The same logic is performed when we finish logging the index keys of a leaf and we find that the next leaf has index keys and was not changed in the current transaction. We are logging the first key of that next leaf and use its offset as the end of range we log. This is just to ensure that if there were deleted index keys between the last index key we logged and the first key of that next leaf, those index keys are deleted if we end up replaying the log. However that is not necessary, we can avoid logging that first index key of the next leaf and instead set the end of the logged range to match the offset of that index key minus 1. So avoid logging those index keys at the boundaries and adjust the start and end offsets of the logged ranges as described above. This patch is part of a patchset comprised of the following patches: 1/4 btrfs: don't log unnecessary boundary keys when logging directory 2/4 btrfs: put initial index value of a directory in a constant 3/4 btrfs: stop copying old dir items when logging a directory 4/4 btrfs: stop trying to log subdirectories created in past transactions Performance test results are listed in the changelog of patch 3/4. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
c816d705b9 |
btrfs: remove write and wait of struct walk_control
The ->write and ->wait fields of struct walk_control, used for log trees, are not used since 2008, more specifically since commit |
||
Filipe Manana
|
4751dc9962 |
btrfs: add missing run of delayed items after unlink during log replay
During log replay, whenever we need to check if a name (dentry) exists in a directory we do searches on the subvolume tree for inode references or or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY keys as well, before kernel 5.17). However when during log replay we unlink a name, through btrfs_unlink_inode(), we may not delete inode references and dir index keys from a subvolume tree and instead just add the deletions to the delayed inode's delayed items, which will only be run when we commit the transaction used for log replay. This means that after an unlink operation during log replay, if we attempt to search for the same name during log replay, we will not see that the name was already deleted, since the deletion is recorded only on the delayed items. We run delayed items after every unlink operation during log replay, except at unlink_old_inode_refs() and at add_inode_ref(). This was due to an overlook, as delayed items should be run after evert unlink, for the reasons stated above. So fix those two cases. Fixes: |
||
Filipe Manana
|
d994788743 |
btrfs: fix lost prealloc extents beyond eof after full fsync
When doing a full fsync, if we have prealloc extents beyond (or at) eof, and the leaves that contain them were not modified in the current transaction, we end up not logging them. This results in losing those extents when we replay the log after a power failure, since the inode is truncated to the current value of the logged i_size. Just like for the fast fsync path, we need to always log all prealloc extents starting at or beyond i_size. The fast fsync case was fixed in commit |
||
Filipe Manana
|
40cdc50987 |
btrfs: skip reserved bytes warning on unmount after log cleanup failure
After the recent changes made by commit |
||
Josef Bacik
|
8697b8f88e |
btrfs: do not check -EAGAIN when truncating inodes in the log root
We only throttle the btrfs_truncate_inode_items if the root is SHAREABLE, which isn't set on the log root, which means this loop is unnecessary. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
71d18b5354 |
btrfs: add inode to truncate control
In the future we're going to want to use btrfs_truncate_inode_items without looking up the associated inode. In order to accommodate this add the inode to btrfs_truncate_control and handle the case where control->inode is NULL appropriately. This is fairly straightforward, we simply need to add a helper for the trace points, as the file extent map update is controlled by a flag on btrfs_truncate_control. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
487e81d2a4 |
btrfs: pass the ino via truncate control
In the future we are going to want to truncate inode items without needing to have an btrfs_inode to pass in, so add ino to the btrfs_truncate_control and use that to look up the inode items to truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
5caa490ed8 |
btrfs: control extent reference updates with a control flag for truncate
We've had weird bugs in the past where we forgot to adjust the truncate path to deal with the fact that we can be called by the tree log path. Instead of checking if our root is a LOG_ROOT use a flag on the btrfs_truncate_control to indicate that we don't want to do extent reference updates during this truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
d9ac19c380 |
btrfs: add truncate control struct
I'm going to be adding more arguments and counters to btrfs_truncate_inode_items, so add a control struct to handle all of the extra arguments to make it easier to follow. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
26c2c4540d |
btrfs: add an inode-item.h
We have a few helpers in inode-item.c, and I'm going to make a few changes to how we do truncate in the future, so break out these definitions into their own header file to trim down ctree.h some and make it easier to do the work on truncate in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
fc28b25e1f |
btrfs: stop accessing ->csum_root directly
We are going to have multiple csum roots in the future, so convert all users of ->csum_root to btrfs_csum_root() and rename ->csum_root to ->_csum_root so we can easily find remaining users in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
3212fa14e7 |
btrfs: drop the _nr from the item helpers
Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
ccae4a19c9 |
btrfs: remove no longer needed logic for replaying directory deletes
Now that we log only dir index keys when logging a directory, we no longer need to deal with dir item keys in the log replay code for replaying directory deletes. This is also true for the case when we replay a log tree created by a kernel that still logs dir items. So remove the remaining code of the replay of directory deletes algorithm that deals with dir item keys. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
339d035424 |
btrfs: only copy dir index keys when logging a directory
Currently, when logging a directory, we copy both dir items and dir index items from the fs/subvolume tree to the log tree. Both items have exactly the same data (same struct btrfs_dir_item), the difference lies in the key values, where a dir index key contains the index number of a directory entry while the dir item key does not, as it's used for doing fast lookups of an entry by name, while the former is used for sorting entries when listing a directory. We can exploit that and log only the dir index items, since they contain all the information needed to correctly add, replace and delete directory entries when replaying a log tree. Logging only the dir index items is also backward and forward compatible: an unpatched kernel (without this change) can correctly replay a log tree generated by a patched kernel (with this patch), and a patched kernel can correctly replay a log tree generated by an unpatched kernel. The backward compatibility is ensured because: 1) For inserting a new dentry: a dentry is only inserted when we find a new dir index key - we can only insert if we know the dir index offset, which is encoded in the dir index key's offset; 2) For deleting dentries: during log replay, before adding or replacing dentries, we first replay dentry deletions. Whenever we find a dir item key or a dir index key in the subvolume/fs tree that is not logged in a range for which the log tree is authoritative, we do the unlink of the dentry, which removes both the existing dir item key and the dir index key. Therefore logging just dir index keys is enough to ensure dentry deletions are correctly replayed; 3) For dentry replacements: they work when we log only dir index keys and this is mostly due to a combination of 1) and 2). If we replace a dentry with name "foobar" to point from inode A to inode B, then we know the dir index key for the new dentry is different from the old one, as it has an index number (key offset) larger than the old one. This results in replaying a deletion, through replay_dir_deletes(), that causes the old dentry to be removed, both the dir item key and the dir index key, as mentioned at 2). Then when processing the new dir index key, we add the new dentry, adding both a new dir item key and a new index key pointing to inode B, as stated in 1). The forward compatibility, the ability for a patched kernel to replay a log created by an older, unpatched kernel, comes from the changes required for making sure we are able to replay a log that only contains dir index keys - we simply ignore every dir item key we find. So modify directory logging to log only dir index items, and modify the log replay process to ignore dir item keys, from log trees created by an unpatched kernel, and process only with dir index keys. This reduces the amount of logged metadata by about half, and therefore the time spent logging or fsyncing large directories (less CPU time and less IO). The following test script was used to measure this change: #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=10000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" echo umount $MNT The tests were run on a physical machine, with a non-debug kernel (Debian's default kernel config), for different values of $NUM_NEW_FILES and $NUM_FILE_DELETES, and the results were the following: ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 8412 ms after adding 1000000 files dir fsync took 500 ms after deleting 10000 files ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 4252 ms after adding 1000000 files (-49.5%) dir fsync took 269 ms after deleting 10000 files (-46.2%) ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 745 ms after adding 100000 files dir fsync took 59 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 404 ms after adding 100000 files (-45.8%) dir fsync took 31 ms after deleting 1000 files (-47.5%) ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 67 ms after adding 10000 files dir fsync took 9 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 36 ms after adding 10000 files (-46.3%) dir fsync took 5 ms after deleting 1000 files (-44.4%) ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 9 ms after adding 1000 files dir fsync took 4 ms after deleting 100 files ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 7 ms after adding 1000 files (-22.2%) dir fsync took 3 ms after deleting 100 files (-25.0%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
1b2e5e5c7f |
btrfs: fix missing last dir item offset update when logging directory
When logging a directory, once we finish processing a leaf that is full
of dir items, if we find the next leaf was not modified in the current
transaction, we grab the first key of that next leaf and log it as to
mark the end of a key range boundary.
However we did not update the value of ctx->last_dir_item_offset, which
tracks the offset of the last logged key. This can result in subsequent
logging of the same directory in the current transaction to not realize
that key was already logged, and then add it to the middle of a batch
that starts with a lower key, resulting later in a leaf with one key
that is duplicated and at non-consecutive slots. When that happens we get
an error later when writing out the leaf, reporting that there is a pair
of keys in wrong order. The report is something like the following:
Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf:
root=18446744073709551610 block=118444032 slot=21, bad key order, prev
(704687 84 4146773349) current (704687 84 1063561078)
Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen
91449 total ptrs 39 free space 546 owner 18446744073709551610
Dec 13 21:44:50 kernel: item 0 key (704687 1 0) itemoff 3835
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35532 size
1026 mode 40755
Dec 13 21:44:50 kernel: item 1 key (704687 12 704685) itemoff
3822 itemsize 13
Dec 13 21:44:50 kernel: item 2 key (704687 24 3817753667)
itemoff 3736 itemsize 86
Dec 13 21:44:50 kernel: item 3 key (704687 60 0) itemoff 3728 itemsize 8
Dec 13 21:44:50 kernel: item 4 key (704687 72 0) itemoff 3720 itemsize 8
Dec 13 21:44:50 kernel: item 5 key (704687 84 140445108)
itemoff 3666 itemsize 54
Dec 13 21:44:50 kernel: dir oid 704793 type 1
Dec 13 21:44:50 kernel: item 6 key (704687 84 298800632)
itemoff 3599 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707849 type 2
Dec 13 21:44:50 kernel: item 7 key (704687 84 476147658)
itemoff 3532 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707901 type 2
Dec 13 21:44:50 kernel: item 8 key (704687 84 633818382)
itemoff 3471 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704694 type 2
Dec 13 21:44:50 kernel: item 9 key (704687 84 654256665)
itemoff 3403 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707841 type 1
Dec 13 21:44:50 kernel: item 10 key (704687 84 995843418)
itemoff 3331 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167736 type 1
Dec 13 21:44:50 kernel: item 11 key (704687 84 1063561078)
itemoff 3278 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 12 key (704687 84 1101156010)
itemoff 3225 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704696 type 1
Dec 13 21:44:50 kernel: item 13 key (704687 84 2521936574)
itemoff 3173 itemsize 52
Dec 13 21:44:50 kernel: dir oid 704704 type 2
Dec 13 21:44:50 kernel: item 14 key (704687 84 2618368432)
itemoff 3112 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704738 type 1
Dec 13 21:44:50 kernel: item 15 key (704687 84 2676316190)
itemoff 3046 itemsize 66
Dec 13 21:44:50 kernel: dir oid 2167729 type 1
Dec 13 21:44:50 kernel: item 16 key (704687 84 3319104192)
itemoff 2986 itemsize 60
Dec 13 21:44:50 kernel: dir oid 704745 type 2
Dec 13 21:44:50 kernel: item 17 key (704687 84 3908046265)
itemoff 2929 itemsize 57
Dec 13 21:44:50 kernel: dir oid 2167734 type 1
Dec 13 21:44:50 kernel: item 18 key (704687 84 3945713089)
itemoff 2857 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167730 type 1
Dec 13 21:44:50 kernel: item 19 key (704687 84 4077169308)
itemoff 2795 itemsize 62
Dec 13 21:44:50 kernel: dir oid 704688 type 1
Dec 13 21:44:50 kernel: item 20 key (704687 84 4146773349)
itemoff 2727 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707892 type 1
Dec 13 21:44:50 kernel: item 21 key (704687 84 1063561078)
itemoff 2674 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 22 key (704687 96 2) itemoff 2612
itemsize 62
Dec 13 21:44:50 kernel: item 23 key (704687 96 6) itemoff 2551
itemsize 61
Dec 13 21:44:50 kernel: item 24 key (704687 96 7) itemoff 2498
itemsize 53
Dec 13 21:44:50 kernel: item 25 key (704687 96 12) itemoff
2446 itemsize 52
Dec 13 21:44:50 kernel: item 26 key (704687 96 14) itemoff
2385 itemsize 61
Dec 13 21:44:50 kernel: item 27 key (704687 96 18) itemoff
2325 itemsize 60
Dec 13 21:44:50 kernel: item 28 key (704687 96 24) itemoff
2271 itemsize 54
Dec 13 21:44:50 kernel: item 29 key (704687 96 28) itemoff
2218 itemsize 53
Dec 13 21:44:50 kernel: item 30 key (704687 96 62) itemoff
2150 itemsize 68
Dec 13 21:44:50 kernel: item 31 key (704687 96 66) itemoff
2083 itemsize 67
Dec 13 21:44:50 kernel: item 32 key (704687 96 75) itemoff
2015 itemsize 68
Dec 13 21:44:50 kernel: item 33 key (704687 96 79) itemoff
1948 itemsize 67
Dec 13 21:44:50 kernel: item 34 key (704687 96 82) itemoff
1882 itemsize 66
Dec 13 21:44:50 kernel: item 35 key (704687 96 83) itemoff
1810 itemsize 72
Dec 13 21:44:50 kernel: item 36 key (704687 96 85) itemoff
1753 itemsize 57
Dec 13 21:44:50 kernel: item 37 key (704687 96 87) itemoff
1681 itemsize 72
Dec 13 21:44:50 kernel: item 38 key (704694 1 0) itemoff 1521
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35534 size 30
mode 40755
Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032
write time tree block corruption detected
So fix that by adding the missing update of ctx->last_dir_item_offset with
the offset of the boundary key.
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/
Fixes:
|
||
Jianglei Nie
|
f35838a693 |
btrfs: fix memory leak in __add_inode_ref()
Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(),
but when the function returns in line 1184 (#4) victim_name allocated
by line 1169 (#3) is not freed, which will lead to a memory leak.
There is a similar snippet of code in this function as allocating a memory
chunk for victim_name in line 1104 (#1) as well as releasing the memory
in line 1116 (#2).
We should kfree() victim_name when the return value of backref_in_log()
is less than zero and before the function returns in line 1184 (#4).
1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans,
1058 struct btrfs_root *root,
1059 struct btrfs_path *path,
1060 struct btrfs_root *log_root,
1061 struct btrfs_inode *dir,
1062 struct btrfs_inode *inode,
1063 u64 inode_objectid, u64 parent_objectid,
1064 u64 ref_index, char *name, int namelen,
1065 int *search_done)
1066 {
1104 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #1: kmalloc (victim_name-1)
1105 if (!victim_name)
1106 return -ENOMEM;
1112 ret = backref_in_log(log_root, &search_key,
1113 parent_objectid, victim_name,
1114 victim_name_len);
1115 if (ret < 0) {
1116 kfree(victim_name); // #2: kfree (victim_name-1)
1117 return ret;
1118 } else if (!ret) {
1169 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #3: kmalloc (victim_name-2)
1170 if (!victim_name)
1171 return -ENOMEM;
1180 ret = backref_in_log(log_root, &search_key,
1181 parent_objectid, victim_name,
1182 victim_name_len);
1183 if (ret < 0) {
1184 return ret; // #4: missing kfree (victim_name-2)
1185 } else if (!ret) {
1241 return 0;
1242 }
Fixes:
|
||
Naohiro Aota
|
84c2544892 |
btrfs: fix re-dirty process of tree-log nodes
There is a report of a transaction abort of -EAGAIN with the following
script.
#!/bin/sh
for d in sda sdb; do
mkfs.btrfs -d single -m single -f /dev/\${d}
done
mount /dev/sda /mnt/test
mount /dev/sdb /mnt/scratch
for dir in test scratch; do
echo 3 >/proc/sys/vm/drop_caches
fio --directory=/mnt/\${dir} --name=fio.\${dir} --rw=read --size=50G --bs=64m \
--numjobs=$(nproc) --time_based --ramp_time=5 --runtime=480 \
--group_reporting |& tee /dev/shm/fio.\${dir}
echo 3 >/proc/sys/vm/drop_caches
done
for d in sda sdb; do
umount /dev/\${d}
done
The stack trace is shown in below.
[3310.967991] BTRFS: error (device sda) in btrfs_commit_transaction:2341: errno=-11 unknown (Error while writing out transaction)
[3310.968060] BTRFS info (device sda): forced readonly
[3310.968064] BTRFS warning (device sda): Skipping commit of aborted transaction.
[3310.968065] ------------[ cut here ]------------
[3310.968066] BTRFS: Transaction aborted (error -11)
[3310.968074] WARNING: CPU: 14 PID: 1684 at fs/btrfs/transaction.c:1946 btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968131] CPU: 14 PID: 1684 Comm: fio Not tainted 5.14.10-300.fc35.x86_64 #1
[3310.968135] Hardware name: DIAWAY Tartu/Tartu, BIOS V2.01.B10 04/08/2021
[3310.968137] RIP: 0010:btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968144] RSP: 0018:ffffb284ce393e10 EFLAGS: 00010282
[3310.968147] RAX: 0000000000000026 RBX: ffff973f147b0f60 RCX: 0000000000000027
[3310.968149] RDX: ffff974ecf098a08 RSI: 0000000000000001 RDI: ffff974ecf098a00
[3310.968150] RBP: ffff973f147b0f08 R08: 0000000000000000 R09: ffffb284ce393c48
[3310.968151] R10: ffffb284ce393c40 R11: ffffffff84f47468 R12: ffff973f101bfc00
[3310.968153] R13: ffff971f20cf2000 R14: 00000000fffffff5 R15: ffff973f147b0e58
[3310.968154] FS: 00007efe65468740(0000) GS:ffff974ecf080000(0000) knlGS:0000000000000000
[3310.968157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3310.968158] CR2: 000055691bcbe260 CR3: 000000105cfa4001 CR4: 0000000000770ee0
[3310.968160] PKRU: 55555554
[3310.968161] Call Trace:
[3310.968167] ? dput+0xd4/0x300
[3310.968174] btrfs_sync_file+0x3f1/0x490
[3310.968180] __x64_sys_fsync+0x33/0x60
[3310.968185] do_syscall_64+0x3b/0x90
[3310.968190] entry_SYSCALL_64_after_hwframe+0x44/0xae
[3310.968194] RIP: 0033:0x7efe6557329b
[3310.968200] RSP: 002b:00007ffe0236ebc0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[3310.968203] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efe6557329b
[3310.968204] RDX: 0000000000000000 RSI: 00007efe58d77010 RDI: 0000000000000006
[3310.968205] RBP: 0000000004000000 R08: 0000000000000000 R09: 00007efe58d77010
[3310.968207] R10: 0000000016cacc0c R11: 0000000000000293 R12: 00007efe5ce95980
[3310.968208] R13: 0000000000000000 R14: 00007efe6447c790 R15: 0000000c80000000
[3310.968212] ---[ end trace 1a346f4d3c0d96ba ]---
[3310.968214] BTRFS: error (device sda) in cleanup_transaction:1946: errno=-11 unknown
The abort occurs because of a write hole while writing out freeing tree
nodes of a tree-log tree. For zoned btrfs, we re-dirty a freed tree
node to ensure btrfs can write the region and does not leave a hole on
write on a zoned device. The current code fails to re-dirty a node
when the tree-log tree's depth is greater or equal to 2. That leads to
a transaction abort with -EAGAIN.
Fix the issue by properly re-dirtying a node on walking up the tree.
Fixes:
|
||
Filipe Manana
|
d1ed82f355 |
btrfs: remove root argument from check_item_in_log()
The root argument passed to check_item_in_log() always matches the root of the given directory, so it can be eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
6d9cc07215 |
btrfs: remove root argument from add_link()
The root argument for tree-log.c:add_link() always matches the root of the given directory and the given inode, so it can eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
4467af8809 |
btrfs: remove root argument from btrfs_unlink_inode()
The root argument passed to btrfs_unlink_inode() and its callee, __btrfs_unlink_inode(), always matches the root of the given directory and the given inode. So remove the argument and make __btrfs_unlink_inode() use the root of the directory. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
9798ba24cb |
btrfs: remove root argument from drop_one_dir_item()
The root argument for drop_one_dir_item() always matches the root of the given directory inode, since each log tree is associated to one and only one subvolume/root, so remove the argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
10adb1152d |
btrfs: fix lost error handling when replaying directory deletes
At replay_dir_deletes(), if find_dir_range() returns an error we break out of the main while loop and then assign a value of 0 (success) to the 'ret' variable, resulting in completely ignoring that an error happened. Fix that by jumping to the 'out' label when find_dir_range() returns an error (negative value). CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Nikolay Borisov
|
f42c5da6c1 |
btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref
In order to make 'real_root' used only in ref-verify it's required to have the necessary context to perform the same checks that this member is used for. So add 'mod_root' which will contain the root on behalf of which a delayed ref was created and a 'skip_group' parameter which will contain callsite-specific override of skip_qgroup. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
8496153945 |
btrfs: add a BTRFS_FS_ERROR helper
We have a few flags that are inconsistently used to describe the fs in different states of failure. As of |
||
Josef Bacik
|
9a35fc9542 |
btrfs: change error handling for btrfs_delete_*_in_log
Currently we will abort the transaction if we get a random error (like -EIO) while trying to remove the directory entries from the root log during rename. However since these are simply log tree related errors, we can mark the trans as needing a full commit. Then if the error was truly catastrophic we'll hit it during the normal commit and abort as appropriate. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
ba51e2a11e |
btrfs: change handle_fs_error in recover_log_trees to aborts
During inspection of the return path for replay I noticed that we don't actually abort the transaction if we get a failure during replay. This isn't a problem necessarily, as we properly return the error and will fail to mount. However we still leave this dangling transaction that could conceivably be committed without thinking there was an error. We were using btrfs_handle_fs_error() here, but that pre-dates the transaction abort code. Simply replace the btrfs_handle_fs_error() calls with transaction aborts, so we still know where exactly things went wrong, and add a few in some other un-handled error cases. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
da1b811fcd |
btrfs: use single bulk copy operations when logging directories
When logging a directory and inserting a batch of directory items, we are copying the data of each item from a leaf in the fs/subvolume tree to a leaf in a log tree, separately. This is not really needed, since we are copying from a contiguous memory area into another one, so we can use a single copy operation to copy all items at once. This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 3/3. The following test was used to compare performance of a branch without the patchset versus one branch that has the whole patchset applied: $ cat dir-fsync-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=1000 LEAF_SIZE=16K mkfs.btrfs -f -n $LEAF_SIZE $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Fsync the directory, this will log the new dir items and the inodes # they point to, because these are new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done # Fsync the directory, this will only log dir items, there are no # dentries pointing to new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" umount $MNT The tests were run on a non-debug kernel (Debian's default kernel config) and were the following: *** with a leaf size of 16K, before patchset *** dir fsync took 8482 ms after adding 1000000 files dir fsync took 166 ms after deleting 1000 files *** with a leaf size of 16K, after patchset *** dir fsync took 8196 ms after adding 1000000 files (-3.4%) dir fsync took 143 ms after deleting 1000 files (-14.9%) *** with a leaf size of 64K, before patchset *** dir fsync took 12851 ms after adding 1000000 files dir fsync took 466 ms after deleting 1000 files *** with a leaf size of 64K, after patchset *** dir fsync took 12287 ms after adding 1000000 files (-4.5%) dir fsync took 414 ms after deleting 1000 files (-11.8%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |