For non-compressed non-hole file extent items, the ram_bytes should
match disk_num_bytes.
But due to kernel bugs, we have several cases where ram_bytes is not
correctly updated.
Thankfully this is really a very minor mismatch and can never cause data
corruption since the kernel does not utilize ram_bytes for
non-compressed extents at all.
So here we just detect and repair it for original mode.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For non-compressed non-hole file extent items, the ram_bytes should
match disk_num_bytes.
But due to kernel bugs, we have several cases where ram_bytes is not
correctly updated.
Thankfully this is really a very minor mismatch and can never cause data
corruption since the kernel does not utilize ram_bytes for
non-compressed extents at all.
So here we just detect and repair it for lowmem mode.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode_cache functionality is long gone and the 'rescue' group
provides the clearning functionality, no point keeping it in check.
Move the --clear-space-cache option to the deprecaeted section so it can
be removed soon.
Signed-off-by: David Sterba <dsterba@suse.com>
Use the objectid, type, offset natural order as it's more readable and
we're used to read keys like that.
Signed-off-by: David Sterba <dsterba@suse.com>
Some steps don't seem to have a message printed when they start, like
the tree-log clearing or checksum fill phase.
Signed-off-by: David Sterba <dsterba@suse.com>
There's an early check of some critical roots right after opening the
filesystem but there's only one message. Check the same roots but print
message for each so it's more specific.
Signed-off-by: David Sterba <dsterba@suse.com>
Use the right helper for unrecognized options so only the unknown one is
printed instead of the whole help text. Also move the case to the end as
is common elsewhere.
Signed-off-by: David Sterba <dsterba@suse.com>
Use the templated error message for transaction start failures, use the
same pattern assigning the ret and errno.
Signed-off-by: David Sterba <dsterba@suse.com>
Turn all BUG_ONs to error handling and push it to the caller. The error
conditions are almost certainly corruptions so we can't do anything
about that.
Signed-off-by: David Sterba <dsterba@suse.com>
The error values of enter_shared_node() are mixing int and bool, unify
that to be 1 == true, 0 == false, <0 errors. Update callers to handle
errors.
Inline the add_shared_node() helper as it's trivial and makes handling
errors easier. As all errors can be now returned, do proper error
handling instead of all remaining BUG_ONs.
Signed-off-by: David Sterba <dsterba@suse.com>
Handle the BUG_ONs inside splice_shared_node() and move them to the
callers. As there's a big loop and external tree cache updated there's
not error cleanup done.
Signed-off-by: David Sterba <dsterba@suse.com>
Free the newly allocated structures when 'mod' is requests and insertion
fails. All exit paths from the function now don't leave anything to
clean up.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a report about failed btrfs-convert, which shows the following
error:
Create btrfs metadata
corrupt leaf: root=5 block=5001931145216 slot=1 ino=89911763, invalid previous key objectid, have 89911762 expect 89911763
leaf 5001931145216 items 336 free space 7 generation 90 owner FS_TREE
leaf 5001931145216 flags 0x1(WRITTEN) backref revision 1
fs uuid 8b69f018-37c3-4b30-b859-42ccfcbe2449
chunk uuid 448ce78c-ea41-49f6-99dc-46ad80b93da9
item 0 key (89911762 INODE_REF 3858733) itemoff 16222 itemsize 61
index 171 namelen 51 name: [FILENAME1]
item 1 key (89911763 INODE_REF 3858733) itemoff 16161 itemsize 61
index 103 namelen 51 name: [FILENAME2]
[CAUSE]
When iterating a directory, btrfs-convert would insert the DIR_ITEMs,
along with the INODE_REF of that inode.
This leads to above stray INODE_REFs, and trigger the tree-checker.
This can only happen for large fs, as for most cases we have all these
modified tree blocks cached, thus tree-checker won't be triggered.
But when the tree block cache is not hit, and we have to read from disk,
then such behavior can lead to above tree-checker error.
[FIX]
Insert a dummy INODE_ITEM for the INODE_REF first, the inode items would
be updated when iterating the child inode of the directory.
Issue: #731
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recently we had a scrub use-after-free caused by unaligned chunk
length, although the fix was submitted, we may want to do extra checks
for a chunk's alignment.
This patch adds such check for the starting bytenr and length of a
chunk, to make sure they are properly aligned to 64K stripe boundary.
By default, the check only leads to a warning but is not treated as an
error, as we expect kernel to handle such unalignment without any
problem.
But if the new debug environmental variable,
BTRFS_PROGS_DEBUG_STRICT_CHUNK_ALIGNMENT, is specified, then we will
treat it as an error. So that we can detect unexpected chunks from
btrfs-progs, and fix them before reaching the end users.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we're already directing the end user to use "btrfs rescue
clear-ino-cache" command, there is not much need to support it in
btrfs-check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Lowmem mode has improved quite a lot since its introduction, for
read-only check it's definitely fine.
For repair mode, both lowmem and original mode are considered dangerous
especially for complex corruptions with unknown cause.
For now lowmem mode is only bad at fixing fundamentally corrupted cases,
like bad shift offsets or transid, which in real world it's not an easy
repair for the original mode either.
This patch would move the --mode option out of the dangerous section and
update the notes for the lowmem mode on its limitation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The option "--clear-space-cache" is not really that suitable for "btrfs
check" group, as there are some concerns:
- Allowing transid mismatch
- No leaf item checks
Thoe behaviour are inherited from the default open ctree flags for
"btrfs check", which can be unsafe if the end user just wants to clear
the cache.
- Unclear if the cache clearing would happen along with repair
Thankfully the clearing of space cache is done without any repair
Thus there is a proposal to move space cache removal to rescue group,
and this patch would do that exactly.
However this would lead to some behavior changes:
- Transid mismatch would be treated as error
- Leaf items size/offset would still be checked
If we hit any above error, we should just abort without doing any
write.
These change would increase the safety of the space cache removal, thus
I believe it's worthy to introduce such behavior change.
Since we're here, also add a small explanation on why we need this
dedicated tool to clear space cache (especially for v1 cache).
Issue: #698
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recently the functionality has been added to the 'rescue' group and
check prints a warning when the option is used but this should be also
visible in the help text.
Signed-off-by: David Sterba <dsterba@suse.com>
Bit shifts should be done on unsigned types as we're approaching 32,
also update some missing descriptions.
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 6cf11f3e38 ("btrfs-progs: check: check order of inline extent
refs") added the ability to detect out-of-order inline extent backref
items.
Meanwhile there is no such ability in lowmem mode, this patch would
introduce such ability to lowmem mode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel seems to order inline extent items in a particular way:
forward by sub-type, then reverse by hash. Having these out of order can
cause a volume to go readonly when deleting an inode.
See https://github.com/maharmstone/ntfs2btrfs/issues/51
With additional comments from the pull request:
- lookup_inline_extent_backref() is skipping the remaining backref item
if data/metadata item is smaller (either through the data hash, or
metadata parent/ref_root) than the target range
- the fix could be still missing in lowmem mode
- image could be created according this comment
https://github.com/maharmstone/ntfs2btrfs/issues/51#issuecomment-1500781204
- due to late merge, the squota newly added key
BTRFS_EXTENT_OWNER_REF_KEY was not part of the patch and the value of
'hash' needs to be verified
Pull-request: #622
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike kernel, in btrfs-progs btrfs_start_transaction() never checks if
there is enough metadata space.
This can lead to very dangerous situation where there is no metadata
space left at all, deadlocking future tree operations.
This patch introduces a very basic version of metadata/system free space
check by:
- Check if there is enough metadata/system space left
If there is enough, go as usual.
- If there is not enough space left, try allocating a new chunk
- Recheck if the new space can meet our demand
If not, return ERR_PTR(-ENOSPC).
Otherwise, allocate a new trans handle to the caller.
This is possible thanks to the simplified transaction model in
btrfs-progs:
- We don't allow joining a transaction
This means we don't need to handle complex cases like data ordered
extents, which need to reserve space first, then join the current
transaction and use the reserved blocks.
- We don't allow multiple transaction handles for one transaction
Since btrfs-progs is single threaded, we always start a transaction
and then commit it.
However there is a feature that must be an exception for the new
metadata/system free space check:
- btrfs check --init-extent-tree
As all the meta/system free space check is based on the space info,
which is loaded from block group items.
Thus when rebuilding extent tree, we can no longer have an accurate
view, thus we have to disable the feature for the whole execution if
we're rebuilding the extent tree.
For now, there is no regression exposed during the self tests, but I
really hope this can be an extra safety net to prevent causing ENOSPC
deadlock in btrfs-progs.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The option "--clear-ino-cache" is not really that suitable for "btrfs
check" group.
Let's move it to "btrfs rescue" group to fix those small hiccups, just
like the existing "btrfs rescue fix-device-size" command.
For now, "btrfs check --clear-ino-cache" would still work, with one
extra warning referring to "btrfs rescue clear-ino-cache".
This is mostly to reduce the surprise, and keep script users (I doubt if
there is any though) happy for now.
In the next or two releases, we would fully remove the support in "btrfs
check" group.
Another small change is, in the documents, we refer to the feature as
"inode map", which doesn't match with the mount option documents.
Since we're here, unify them to "inode cache" feature.
Issue: #669
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The clear-cache functionality is shared by several commands:
- btrfs check
For --clear-cache and --clear-ino-cache.
- btrfstune
Mostly for block-group-tree feature conversion.
- btrfs-convert
To enable the now default v2 space cache.
Thus it's no longer proper to keep clear-cache.[ch] under check/
directory, move them to common/ directory.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are quite some variable shadowing in btrfs-progs, most of them are
just reusing some common names like tmp.
And those are quite safe and the shadowed one are even different type.
But there are some exceptions:
- @end in traverse_tree_blocks()
There is already an @end with the same type, but a different meaning
(the end of the current extent buffer passed in).
Just rename it to @child_end.
- @start in generate_new_data_csums_range()
Just rename it to @csum_start.
- @size of fixup_chunk_tree_block()
This one is particularly bad, we declare a local @size and initialize
it to -1, then before we really utilize the variable @size, we
immediately reset it to 0, then pass it to logical_to_physical().
Then there is a location to check if @size is -1, which will always be
true.
According to the code in logical_to_physical(), @size would be clamped
down by its original value, thus our local @size will always be 0.
This patch would rename the local @size to @found_size, and only set
it to -1.
The call site is only to pass something as logical_to_physical()
requires a non-NULL pointer.
We don't really need to bother the returned value.
- duplicated @ref declaration in run_delayed_tree_ref()
- duplicated @super_flags in change_meta_csums()
Just delete the duplicated one.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several places that we call btrfs_abort_transaction() in a
failure case, but never call btrfs_commit_transaction(). This leaks the
trans handle and the associated extent buffers and such. Fix all these
sites by making sure we call btrfs_commit_transaction() after we call
btrfs_abort_transaction() to make sure all the appropriate cleanup is
done. This gets rid of the leaked extent buffer errors.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the kernel we've added a control struct to handle the different
checks we want to do on extent buffers when we read them. Update our
copy of read_tree_block to take this as an argument, then update all of
the callers to use the new structure.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel version of btrfs_del_ptr takes a trans handle as an argument
and returns an error in the case of tree-mod-log, update our version to
match to make syncing ctree.c more straightforward.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the kernel we pass in the parent to btrfs_alloc_tree_block instead of
the blocksize and simply derive the blocksize from the fs_info. Update
the function to match the kernel's convention and update all of the
callers so we can sync ctree.c easily.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This always returns 0, and in the kernel is a void. Update the
definition to match the kernel and then update all of the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This helper exists for check and for btrfs-corrupt-block. Move the
helper and the btrfs_fixup_low_keys helper into check/repair.[ch] so we
can keep the kernel-shared sources close to the upstream kernel.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This simply zero's out the path, and this is used everywhere we use a
stack path. Drop this usage and simply init the path's to empty instead
of using a function to do the memset.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was updated to include a first_slot argument, update it to match
the kernel definition to make it easier to sync ctree.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used to make sure the root is updated in the tree_root when we
re-init the root, however this function is static in the kernel and
doesn't need to be exported for any reason. Simply update the root item
and then update it in the tree_root instead of adding it to the dirty
list.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the ability to enable simple quotas from mkfs with '-O squota'
There is some complication around handling enable_gen while still
counting the root node of an fs. To handle this, employ a hack of doing
a no-op write on the root node to bump its generation up above that of
the qgroup enable generation, which results in counting it properly.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Add simple quotas checks to btrfs check.
Like the kernel feature, these checks bypass most of the backref walking
in the qgroups check. Instead, they enforce the invariant behind the
design of simple quotas by scanning the extent tree and determining the
owner of each extent:
Data: reading the owner ref inline item
Metadata: reading the tree block and reading its btrfs_header's owner
This gives us the expected count from squotas which we check against the
on-disk state of the qgroup items.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
For a skinny metadata item in the extent tree, the key offset represents
the level of the tree it points to. This adds a check that these values
match, as otherwise it can cause a volume to go readonly when deleting a
large number of inodes.
See https://github.com/maharmstone/ntfs2btrfs/issues/51
Pull-request: #623
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The struct open_ctree_flags currently holds arguments for
open_ctree_fs_info(), it can be confusing when mixed with a local variable
named open_ctree_flags as below in the function cmd_inspect_dump_tree().
cmd_inspect_dump_tree()
::
struct open_ctree_flags ocf = { 0 };
::
unsigned open_ctree_flags;
So rename struct open_ctree_flags to struct open_ctree_args.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch would modify btrfs_csum_file_block() to handle csum type
other than the one used in the current fs.
The new data checksum would use a different objectid (-13) to
distinguish with the existing one (-10).
This needs to change tree-checker to skip the item size checks,
since new csum can be larger than the original csum.
After this stage, the resulted csum tree would look like this:
item 0 key (CSUM_CHANGE EXTENT_CSUM 13631488) itemoff 8091 itemsize 8192
range start 13631488 end 22020096 length 8388608
item 1 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 7067 itemsize 1024
range start 13631488 end 14680064 length 1048576
Note the itemsize is 8 times the original one, as the original csum is
CRC32, while target csum is SHA256, which is 8 times the size.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This syncs tree-checker.c from the kernel. The main modification was to
add a open ctree flag to skip the deeper leaf checks, and plumbing this
through tree-checker.c. We need this for things like fsck or
btrfs-image that need to work with slightly corrupted file systems, and
these checks simply make us unable to look at the corrupted blocks.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs-progs we check the actual leaf pointers as well as the chunk
itself in btrfs_check_chunk_valid. However in the kernel the leaf stuff
is handled separately as part of the read, and then we have the chunk
checker itself. Change the btrfs-progs version to match the in-kernel
version temporarily so it makes syncing the in-kernel code easier.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These helpers are called __btrfs_check_* in the kernel as they return
the special enum to indicate what part of the leaf/node failed. Rename
the uses in btrfs-progs to match the kernel naming convention to make it
easier to sync that code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>