Journal replay now first attempts to replay keys in sorted order,
similar to how the btree write buffer flush path works.
Any keys that can not be replayed due to journal deadlock are then left
for later and replayed in journal order, unpinning journal entries as we
go.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This gets us slightly nicer log messages.
Also, this slightly clarifies synchronization of c->journal_keys; after
we go RW it's in use by multiple threads (so that the btree iterator
code can overlay keys from the journal); so it has to be prepped before
that point.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can
now switch the btree write buffer to use it for flushing.
This has the advantage that transaction commits don't need to take a
journal reservation at all.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This slightly changes how trans->journal_res works, in preparation for
changing the btree write buffer flush path to use it.
Now, BTREE_INSERT_JOURNAL_REPLAY means "don't take a journal
reservation; trans->journal_res.seq already refers to the journal
sequence number to pin".
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The upcoming btree write buffer rework is going to use the journal
itself as the first stage of the write buffer; this is a cleanup to make
sure k->needs_whiteout is initialized before keys hit the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces a new helper for connecting time_stats to state changes,
i.e. when taking journal reservations is blocked for some reason.
We use this to track separately the different reasons the journal might
be blocked - i.e. space in the journal full, or the journal pin fifo
full.
Also do some cleanup and improvements on the time stats code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, bch2_journal_pin_set() would silently ignore a request to
pin a journal sequence number that was no longer dirty, because it was
used internally by bch2_journal_pin_copy() which could race with the src
pin being flushed.
Split these apart so that we can properly assert that @seq is a
currently dirty journal sequence number - this is almost always a bug.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In an ideal world, we'd have a common helper that could be used for
sorting a list of inodes into the correct lock order, and then the same
lock ordering could be used for any type of inode lock, not just
i_rwsem.
But the lock ordering rules for i_rwsem are a bit complicated, so -
abandon that dream for now and do it the more standard way.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also log time waiting for c->writes references to be dropped; this will
help in debugging why unmounts are taking longer than they should.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's confusing if we run fsck a second time (in debug mode, to verify
the second run is clean), but errors are still ratelimited from the
first run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add checks to all the VFS paths for "are we in a RO snapshot?".
Note - we don't check this when setting inode options via our xattr
interface, since those generally only affect data placement, not
contents of data.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: "Carl E. Thompson" <list-bcachefs@carlthompson.net>
Add a new superblock section that contains a list of
{ minor version, recovery passes, errors_to_fix }
that is - a list of recovery passes that must be run when downgrading
past a given version, and a list of errors to silently fix.
The upcoming disk accounting rewrite is not going to be fully
compatible: we're going to have to regenerate accounting both when
upgrading to the new version, and also from downgrading from the new
version, since the new method of doing disk space accounting is a
completely different architecture based on deltas, and synchronizing
them for every jounal entry write to maintain compatibility is going to
be too expensive and impractical.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add two new superblock fields. Since the main section of the superblock
is now fully, we have to add a new variable length section for them -
bch_sb_field_ext.
- recovery_passes_requried: recovery passes that must be run on the
next mount
- errors_silent: errors that will be silently fixed
These are to improve upgrading and dwongrading: these fields won't be
cleared until after recovery successfully completes, so there won't be
any issues with crashing partway through an upgrade or a downgrade.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch will start to refer to recovery passes from the
superblock; naturally, we now need identifiers that don't change, since
the existing enum is in the order in which they are run and is not
fixed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_REPLICAS_MAX isn't the actual maximum number of pointers in an
extent, it's the maximum number of dirty pointers.
We don't have a real restriction on the number of cached pointers, and
we don't want a fixed size array here anyways - so switch to
DARRAY_PREALLOCATED().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-and-tested-by: Daniel J Blueman <daniel@quora.org>
We sometimes use darrays for quite large buffers - the btree write
buffer in particular needs large buffers, since it must be sized to hold
all the write buffer keys outstanding in the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Move the slowpath (actually growing the darray) to an out-of-line
function; also, add some helpers for the upcoming btree write buffer
rewrite.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a superblock write hasn't happened (i.e. we never had to go rw), then
c->sb.version will be out of date w.r.t. c->disk_sb.sb->version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
turns out iterate_iovec() mutates __iov, we need to save our own copy
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: Marcin Mirosław <marcin@mejor.pl>
peek_upto() checks against the end position and bails out before
FILTER_SNAPSHOTS checks; this is because if we end up at a different
inode number than the original search key none of the keys we see might
be visibile in the current snapshot - we might be looking at inode in a
completely different subvolume.
But this is broken, because when we're iterating over extents we're
checking against the extent start position to decide when to bail out,
and the extent start position isn't monotonically increasing until after
we've run FILTER_SNAPSHOTS.
Fix this by adding a simple inode number check where the old bailout
check was, and moving the main check to the correct position.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: "Carl E. Thompson" <list-bcachefs@carlthompson.net>
The recent work to fix data moves w.r.t. durability broke promotes,
because the caused us to bail out when the extent minus pointers being
dropped still has enough pointers to satisfy the current number of
replicas.
Disable this check when we're adding cached replicas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When overwriting and splitting existing extents, we weren't correctly
accounting for a 3 way split of a compressed extent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we fail to allocate because of insufficient open buckets, we don't
want to retry from the full set of devices - we just want to retry in
blocking mode.
But if the retry in blocking mode fails with a different error code, we
end up squashing the -BCH_ERR_open_buckets_empty error with an error
that makes us thing we won't be able to allocate (insufficient_devices)
- which is incorrect when we didn't try to allocate from the full set of
devices, and causes the write to fail.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to help modprobe load architecture specific modules so we don't
fall back to generic software implementations, this should help
performance when building as a module.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When bch2_fs_alloc() gets an error before calling
bch2_fs_btree_iter_init(), bch2_fs_btree_iter_exit() makes an invalid
memory access because btree_trans_list is uninitialized.
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Fixes: 6bd68ec266 ("bcachefs: Heap allocate btree_trans")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The ->encode_fh method is responsible for setting amount of space
required for storing the file handle if not enough space was provided.
bch2_encode_fh() was not setting required length in that case which
breaks e.g. fanotify. Fix it.
Reported-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On trylock failure we were waiting for outstanding reads to complete -
but nocow locks need to be held until the whole move is finished.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since outstanding journal buffers hold a journal pin, when flushing all
pins we need to close the current journal entry if necessary so its pin
can be released.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We could delete directories transactionally on rmdir()/unlink(), but we
don't; instead, like with regular files we wait for the VFS to call
evict().
That means that our check for directories in the deleted inodes btree is
wrong - the check should be for non-empty directories.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug where rebalance would loop repeatedly on the same
extents.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The internal freeze mechanism in bcachefs mostly reuses the generic
rw<->ro transition code. If the fs happens to shutdown during or
after freeze, a transition back to rw can fail. This is expected,
but returning an error from the unfreeze callout prevents the
filesystem from being unfrozen.
Skip the read write transition if the fs is shutdown. This allows
the fs to unfreeze at the vfs level so writes will no longer block,
but will still fail due to the emergency read-only state of the fs.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When creating a snapshot without specifying the source subvolume, we use
the subvolume containing the new snapshot.
Previously, this worked if the directory containing the new snapshot was
the subvolume root - but we were using the incorrect helper, and got a
subvolume ID of 0 when the parent directory wasn't the root of the
subvolume, causing an emergency read-only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a transaction path overflow reported in the snapshot deletion
path, when moving extents to the correct snapshot.
The root of the issue is that creating/deleting a reflink pointer can
generate an unbounded number of updates, if it is allowed to reference
an unbounded number of indirect extents; to prevent this, merging of
reflink pointers has been disabled.
But there's a hole, which is that copygc/rebalance may fragment existing
extents in the course of moving them around, and if an indirect extent
becomes too fragmented we'll then become unable to delete the reflink
pointer.
The eventual solution is going to be to tweak trigger handling so that
we can process large reflink pointers incrementally when necessary, and
notice that trigger updates don't need to be run for the part of the
reflink pointer not changing. That is going to be a bigger project
though, for another patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for_each_btree_key2() runs each loop iteration in a btree transaction,
and thus does not cause SRCU lock hold time problems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>