This adds a small (64 bit) per-device bitmap that tracks ranges that
have btree nodes, for accelerating btree node scan if it is ever needed.
- New helpers, bch2_dev_btree_bitmap_marked() and
bch2_dev_bitmap_mark(), for checking and updating the bitmap
- Interior btree update path updates the bitmaps when required
- The check_allocations pass has a new fsck_err check,
btree_bitmap_not_marked
- New on disk format version, mi_btree_mitmap, which indicates the new
bitmap is present
- Upgrade table lists the required recovery pass and expected fsck error
- Btree node scan uses the bitmap to skip ranges if we're on the new
version
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's been a bug in the btree write buffer where it wasn't triggering
btree node merges - and leaving behind a bunch of nearly empty btree
nodes.
Then during journal replay, when updates to the backpointers btree
aren't using the btree write buffer (because we require synchronization
with journal replay), we end up doing those merges all at once.
Then if it's the interior update path running them, we deadlock because
those run with the highest watermark.
There's no real need for the interior update path to be doing btree node
merges; other code paths can handle that at lower watermarks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a deadlock where the interior update path during journal
replay ends up doing a ton of merges on the backpointers btree, and
deadlocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It turns out - btree splits happen with the rest of the transaction
still locked, to avoid unnecessary restarts, which means using nofail
doesn't work here - we can deadlock.
Fortunately, we now have the ability to return errors here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
One btree update might have terminated in a node update, and then while
it is in flight another btree update might free that original node.
This race has to be handled in btree_update_nodes_written() - we were
missing a READ_ONCE().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
"bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant
performance regression (nearly 50%) on multithreaded random writes with
fio.
The reason is that the journal watermark checks multiple things,
including the state of the btree write buffer, and on multithreaded
update heavy workloads we're bottleneked on write buffer flushing - we
don't want kicknig off btree updates to depend on the state of the write
buffer.
This isn't strictly correct; the interior btree update path does do
write buffer updates, but it's a tiny fraction of total accounting
updates and we're more concerned with space in the journal itself.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new watermark, higher priority than BCH_WATERMARK_reclaim,
for interior btree updates. We've seen a deadlock where journal replay
triggers a ton of btree node merges, and these use up all available open
buckets and then interior updates get stuck.
One cause of this is that we're currently lacking btree node merging on
write buffer btrees - that needs to be fixed as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.
So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we haven't yet allocated any btree nodes for a given btree, we
first need to call the regular split path to allocate one.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When dropping keys now outside a now because we're changing the node
min/max, we need to redo the node's accounting as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Consolidate bch2_gc_check_topology() and btree_node_interior_verify(),
and replace them with an improved version,
bch2_btree_node_check_topology().
This checks that children of an interior node correctly span the full
range of the parent node with no overlaps.
Also, ensure that topology repairs at runtime are always a fatal error;
in particular, this adds a check in btree_iter_down() - if we don't find
a key while walking down the btree that's indicative of a topology error
and should be flagged as such, not a null ptr deref.
Some checks in btree_update_interior.c remaining BUG_ONS(), because we
already checked the node for topology errors when starting the update,
and the assertions indicate that we _just_ corrupted the btree node -
i.e. the problem can't be that existing on disk corruption, they
indicate an actual algorithmic bug.
In the future, we'll be annotating the fsck errors list with which
recovery pass corrects them; the open coded "run explicit recovery pass
or fatal error" in bch2_btree_node_check_topology() will in the future
be done for every fsck_err() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_TRANS_COMMIT_journal_reclaim with watermark != BCH_WATERMARK_reclaim
means nonblocking, and we need the journal_res_get() in
btree_update_start() to respect that.
In a future refactoring we'll be deleting
BCH_TRANS_COMMIT_journal_reclaim and replacing it with an explicit
BCH_TRANS_COMMIT_nonblocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a deadlock due to using btree_interior_update_worker for non
interior updates - async btree node rewrites were blocking, and then
blocking other interior updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This converts -EIOs related to btree node errors to private error codes,
which will help with some ongoing debugging by giving us better error
messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
WQ_UNBOUND with max_active 1 means ordered workqueue, but we don't
actually need or want ordered semantics - and probably want a higher
concurrency limit anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When a btree root is unreadable, we still might be able to get some data
back by replaying what's in the journal. Previously though, we got
confused when journal replay would attempt to replay a key for a level
that didn't exist.
This adds bch2_btree_increase_depth(), so that journal replay can handle
this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This prevents going emergency read only when the user has specified
replicas_required > replicas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.
And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.
Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for disk space accounting rewrite: we're going to want to use
a single callback for both of our current triggers, so we need to change
them to have the same type signature first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since the btree_paths array is now about to become growable, we have to
be careful not to refer to paths by pointer across contexts where they
may be reallocated.
This fixes the remaining btree_interior_update() paths - split and
merge.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
path->idx is now a code smell: we should be using path_idx_t, since it's
stable across btree path reallocation.
This is also a bit faster, using the same loop counter vs. fetching
path->idx from each path we iterate over.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The previous patch fixed a bug in allocation path error handling, and it
would've been noticed sooner had it been logged properly.
Generally speaking, errors that shouldn't happen in normal operation and
are being returned up the stack should be logged: the write path was
already logging IO errors, but non IO errors were missed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the future we'll be making trans->paths resizable and potentially
having _many_ more paths (for fsck); we need to start fixing algorithms
that walk each path in a transaction where possible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of using a darray, we now allocate journal entries for the
transaction commit path with our normal bump allocator - with an inlined
fastpath, and using btree_transaction_stats to remember how much to
initially allocate so as to avoid transaction restarts.
This is prep work for converting write buffer updates to use this
mechanism.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>