When searching the link table for the matching inode, we were searching
for a specific - incorrect - snapshot ID as well, causing us to fail to
find the inode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a superblock error counter for every distinct fsck
error; this means that when analyzing filesystems out in the wild we'll
be able to see what sorts of inconsistencies are being found and repair,
and hence what bugs to look for.
Errors validating bkeys are not yet considered distinct fsck errors, but
this patch adds a new helper, bkey_fsck_err(), in order to add distinct
error types for them as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we can run with unknown btree IDs, we can't directly index btree
IDs into fixed size arrays.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Be a bit more careful about when bch2_delete_dead_snapshots needs to
run: it only needs to run synchronously if we're running fsck, and it
only needs to run at all if we have snapshot nodes to delete or if fsck
has noticed that it needs to run.
Also:
Rename BCH_FS_HAVE_DELETED_SNAPSHOTS -> BCH_FS_NEED_DELETE_DEAD_SNAPSHOTS
Kill bch2_delete_dead_snapshots_hook(), move functionality to
bch2_mark_snapshot()
Factor out bch2_check_snapshot_needs_deletion(), to explicitly check
if we need to be running snapshot deletion.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These errors aren't actual errors, and should never be printed - do this
in the common helpers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.
But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we handle a transaction restart in a nested context, we need to
return -BCH_ERR_transaction_restart_nested because we invalidated the
outer context's iterators and locks.
bch2_propagate_key_to_snapshot_leaves() wasn't doing this, this patch
fixes it to use trans_was_restarted().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If fsck finds a key that needs work done, the primary example being an
unlinked inode that needs to be deleted, and the key is in an internal
snapshot node, we have a bit of a conundrum.
The conundrum is that internal snapshot nodes are shared, and we in
general do updates in internal snapshot nodes because there may be
overwrites in some snapshots and not others, and this may affect other
keys referenced by this key (i.e. extents).
For example, we might be seeing an unlinked inode in an internal
snapshot node, but then in one child snapshot the inode might have been
reattached and might not be unlinked. Deleting the inode in the internal
snapshot node would be wrong, because then we'll delete all the extents
that the child snapshot references.
But if an unlinked inode does not have any overwrites in child
snapshots, we're fine: the inode is overwrritten in all child snapshots,
so we can do the deletion at the point of comonality in the snapshot
tree, i.e. the node where we found it.
This patch adds a new helper, bch2_propagate_key_to_snapshot_leaves(),
to handle the case where we need a to update a key that does have
overwrites in child snapshots: we copy the key to leaf snapshot nodes,
and then rewind fsck and process the needed updates there.
With this, fsck can now always correctly handle unlinked inodes found in
internal snapshot nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
subvolume.c has gotten a bit large, this splits out a separate file just
for managing snapshot trees - BTREE_ID_snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A number of smallish fixes for overlapping extent repair, and (part of)
a new unit test. This fixes all the issues turned up by bhzhu203, in his
filesystem image from running mongodb + snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces bch2_run_explicit_recovery_pass() and uses it for when
fsck detects that we need to re-run dead snaphots cleanup, and makes
dead snapshot cleanup more like a normal recovery pass.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We currently don't track whether snapshot cleanup still needs to finish
(aside from running a full fsck), so it shouldn't be a fsck error yet -
fsck -n after fsck has succesfully completed shouldn't error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Delete a redundant bch2_snapshot_is_ancestor() check, and convert some
assertions to debug assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Make the overlapping extent check/repair code more self contained.
This is prep work for hopefully reducing key_visible_in_snapshot() usage
here as well, and also includes a nice performance optimization to not
check ref_visible2() unless the extents potentially overlap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This changes the main part of check_extents(), that checks the extent
against the corresponding inode, to not use key_visible_in_snapshot().
key_visible_in_snapshot() has to iterate over the list of ancestor
overwrites repeatedly calling bch2_snapshot_is_ancestor(), so this is a
significant performance improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More prep work for reducing key_visible_in_snapshot() usage - this
rearranges how KEY_TYPE_whitout keys are handled, so that they can be
marked off in inode_warker->inode->seen_this_pos.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We only want to synthesize an inode for the current snapshot ID for non
whiteouts - this refactoring lets us call walk_inode() earlier and clean
up some control flow.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Minor refactoring/dead code deletion, prep work for reworking
check_extent() to avoid key_visible_in_snapshot().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This improves the repair path for overlapping extents - we now verify
that we find in the btree the overlapping extents that the algorithm
detected, and fail the fsck run with a more useful error if it doesn't
match.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for changing check_extent() to avoid
key_visible_in_snapshot() - this adds the state to track whether an
inode has seen an extent at this pos.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This switches the generic radix tree for the in-memory table of snapshot
nodes to a simple rcu array. This means we have to add new locking to
deal with reallocations, but is faster than traversing the radix tree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recovery and fsck have many different passes/jobs to do, which always
run in the same order - but not all of them run all the time. Some are
for fsck, some for unclean shutdown, some for version upgrades.
This adds some new structure: a defined list of recovery passes that we
can run in a loop, as well as consolidating the log messages.
The main benefit is consolidating the "should run this recovery pass"
logic, as well as cleaning up the "this recovery pass has finished"
state; instead of having a bunch of ad-hoc state bits in c->flags, we've
now got c->curr_recovery_pass.
By consolidating the "should run this recovery pass" logic, in the
future on disk format upgrades will be able to say "upgrading to this
version requires x passes to run", instead of forcing all of fsck to
run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Some refactoring, prep work for algorithm improvements related to
snapshots.
we need to add a bitmap to the list of inodes for "seen this snapshot";
for this bitmap to correctly be available, we'll need to gather the list
of inodes first, and later look up the inode for a given snapshot.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now print out the full previous extent we overlapping with, to aid in
debugging and searching through the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add two new helpers for printing error messages with __func__ and
bch2_err_str():
- bch_err_fn
- bch_err_msg
Also kill the old error strings in the recovery path, which were causing
us to incorrectly report memory allocation failures - they're not needed
anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As with previous conversions, replace -ENOENT uses with more informative
private error codes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new btree which gets us a persistent per-snapshot-tree
identifier.
- BTREE_ID_snapshot_trees
- KEY_TYPE_snapshot_tree
- bch_snapshot now has a field that points to a snapshot_tree
This is going to be used to designate one snapshot ID/subvolume out of a
given tree of snapshots as the "main" subvolume, so that we can do quota
accounting in that subvolume and not the rest.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's safe to call bch2_trans_update with a k/v pair where the value
hasn't been filled out, as long as the key part has been and the value
is filled out by transaction commit time.
This patch folds the bch2_trans_update() call into bch2_bkey_make_mut(),
eliminating a bit of boilerplate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce new helpers for a common pattern:
bch2_trans_iter_init();
bch2_btree_iter_peek_slot();
- bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of
the correct type
- bch2_bkey_get_val_typed() copies the val out of the btree to a
(typically stack allocated) variable; it handles the case where the
value in the btree is smaller than the current version of the type,
zeroing out the remainder.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had some bugs with setting/using first_this_inode in the inode
walker in the dirents/xattr code.
This patch changes to not clear first_this_inode until after
initializing the new hash info.
Also, we fix an error message to not print on transaction restart, and
add a comment to related fsck error code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It turns out, in rare situations we need to be passing in a disk
reservation, which will be used internally by the transaction commit
path when needed. Pass one in...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>