Commit Graph

112 Commits

Author SHA1 Message Date
Kent Overstreet
275c8426fb bcachefs: Add rw to .key_invalid()
This adds a new parameter to .key_invalid() methods for whether the key
is being read or written; the idea being that methods can do more
aggressive checks when a key is newly created and being written, when we
wouldn't want to delete the key because of those checks.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
f0ac7df23d bcachefs: Convert .key_invalid methods to printbufs
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
c6b2826cd1 bcachefs: Freespace, need_discard btrees
This adds two new btrees for the upcoming allocator rewrite: an extents
btree of free buckets, and a btree for buckets awaiting discards.

We also add a new trigger for alloc keys to keep the new btrees up to
date, and a compatibility path to initialize them on existing
filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
3756111d13 bcachefs: Add printf format attribute to bch2_pr_buf()
This tells the compiler to check printf format strings, and catches a
few bugs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:28 -04:00
Kent Overstreet
74b33393db bcachefs: x-macro metadata version enum
Now we've got strings for metadata versions - this changes
bch2_sb_to_text() and our mount log message to use it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:28 -04:00
Kent Overstreet
cc23255e9a bcachefs: Add a missing wakeup
This fixes a rare bug with bch2_btree_flush_all_writes() getting stuck.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:27 -04:00
Kent Overstreet
3098553776 bcachefs: Fix usage of six lock's percpu mode
Six locks have a percpu mode, which we use for interior btree nodes, as
well as btree key cache keys for the subvolumes btree. We've been
switching locks back and forth between percpu and non percpu mode as
needed, but it turns out this is racy - when we're reusing an existing
node, other threads could be attempting to lock it while we're switching
it between modes.

This patch fixes this by never switching 'struct btree' between the two
modes, and instead segragating them between two different freed lists.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:26 -04:00
Kent Overstreet
bf3efff5e4 bcachefs: Fix race leading to btree node write getting stuck
Checking btree_node_may_write() isn't atomic with the other btree flags,
dirty and need_write in particular. There was a rare race where we'd
unblock a node from writing while __btree_node_flush() was setting
need_write, and no thread would notice that the node was now both able
to write and needed to be written.

Fix this by adding btree node flags for will_make_reachable and
write_blocked that can be checked in the cmpxchg loop in
__bch2_btree_node_write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
82732ef510 bcachefs: Improve btree_node_write_if_need()
btree_node_write_if_need() kicks off a btree node write only if
need_write is set; this makes the locking easier to reason about by
moving the check into the cmpxchg loop in __bch2_btree_node_write().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
39dcace838 bcachefs: Fix locking in btree_node_write_done()
There was a rare recursive locking bug, in __bch2_btree_node_write()
nowrite path -> btree_node_write_done(), in the path that kicks off
another write.

This splits out an inner __btree_node_write_done() that expects to be
run with the btree node lock held.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
75ef2c59bc bcachefs: Start moving debug info from sysfs to debugfs
In sysfs, files can only output at most PAGE_SIZE. This is a problem for
debug info that needs to list an arbitrary number of times, and because
of this limit some of our debug info has been terser and harder to read
than we'd like.

This patch moves info about journal pins and cached btree nodes to
debugfs, and greatly expands and improves the output we return.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
55334d7897 bcachefs: Kill BCH_FS_HOLD_BTREE_WRITES
This was just dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:25 -04:00
Kent Overstreet
fa8e94faee bcachefs: Heap allocate printbufs
This patch changes printbufs dynamically allocate and reallocate a
buffer as needed. Stack usage has become a bit of a problem, and a major
cause of that has been static size string buffers on the stack.

The most involved part of this refactoring is that printbufs must now be
exited with printbuf_exit().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:25 -04:00
Kent Overstreet
78a8f36280 bcachefs: Improve some btree node read error messages
On btree node read error, it's helpful to see what we were trying to
read - was it all zeroes?

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:25 -04:00
Kent Overstreet
a9de137bf6 bcachefs: Check for errors from crypto_skcipher_encrypt()
Apparently it actually is possible for crypto_skcipher_encrypt() to
return an error - not sure why that would be - but we need to replace
our assertion with actual error handling.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:24 -04:00
Kent Overstreet
03ea3962ab bcachefs: Log & error message improvements
- Add a shim uuid_unparse_lower() in the kernel, since %pU doesn't work
   in userspace

 - We don't need to print the bcachefs: or the filesystem name prefix in
   userspace

 - Improve a few error messages

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:21 -04:00
Kent Overstreet
8244f3209b bcachefs: Option improvements
This adds flags for options that must be a power of two (block size and
btree node size), and options that are stored in the superblock as a
power of two (encoded extent max).

Also: options are now stored in memory in the same units they're
displayed in (bytes): we now convert when getting and setting from the
superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
62d5bd955f bcachefs: Kill bch2_sort_repack_merge()
The main function of bch2_sort_repack_merge() was to call .key_normalize
on every key, which drops stale (cached) pointers - it hasn't actually
merged extents in quite some time.

But bch2_gc_gens() now works on individual keys - we used to gc old gens
by rewriting entire btree nodes. With that gone, there's no need for
internal btree code to be calling .key_normalize anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
2a863c6c80 bcachefs: Fix debug build in userspace
This fixes some compiler warnings that only trigger in userspace - dead
code, a maybe uninitialed variable, a maybe null ptr passed to printk.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:18 -04:00
Kent Overstreet
c79272d1e4 bcachefs: Fix some compiler warnings
gcc couldn't always deduce that written wasn't used uninitialized

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:12 -04:00
Kent Overstreet
f7a966a3e2 bcachefs: Clean up/rename bch2_trans_node_* fns
These utility functions are for managing btree node state within a
btree_trans - rename them for consistency, and drop some unneeded
arguments.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:11 -04:00
Kent Overstreet
9f6bd30703 bcachefs: Reduce iter->trans usage
Disfavoured, and should go away.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:10 -04:00
Kent Overstreet
e719fc34f0 bcachefs: BSET_OFFSET()
Add a field to struct bset for the sector offset within the btree node
where it was written.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:09 -04:00
Kent Overstreet
9f1833cadd bcachefs: Update btree ptrs after every write
This closes a significant hole (and last known hole) in our ability to
verify metadata. Previously, since btree nodes are log structured, we
couldn't detect lost btree writes that weren't the first write to a
given node. Additionally, this seems to have lead to some significant
metadata corruption on multi device filesystems with metadata
replication: since a write may have made it to one device and not
another, if we read that btree node back from the replica that did have
that write and started appending after that point, the other replica
would have a gap in the bset entries and reading from that replica
wouldn't find the rest of the bsets.

But, since updates to interior btree nodes are now journalled, we can
close this hole by updating pointers to btree nodes after every write
with the currently written number of sectors, without negatively
affecting performance. This means we will always detect lost or corrupt
metadata - it also means that our btree is now a curious hybrid of COW
and non COW btrees, with all the benefits of both (excluding
complexity).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:08 -04:00
Kent Overstreet
0a70089062 bcachefs: Kick off btree node writes from write completions
This is a performance improvement by removing the need to wait for the
in flight btree write to complete before kicking one off, which is going
to be needed to avoid a performance regression with the upcoming patch
to update btree ptrs after every btree write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:08 -04:00
Kent Overstreet
19d5432445 bcachefs: Really don't hold btree locks while btree IOs are in flight
This is something we've attempted to stick to for quite some time, as it
helps guarantee filesystem latency - but there's a few remaining paths
that this patch fixes.

This is also necessary for an upcoming patch to update btree pointers
after every btree write - since the btree write completion path will now
be doing btree operations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:08 -04:00
Kent Overstreet
e3a67bdb6e bcachefs: Regularize argument passing of btree_trans
btree_trans should always be passed when we have one - iter->trans is
disfavoured. This mainly updates old code in btree_update_interior.c,
some of which predates btree_trans.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:08 -04:00
Kent Overstreet
50ad5d0979 bcachefs: Fix btree_node_read_all_replicas() error handling
We weren't checking bch2_btree_node_read_done() for errors, oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:07 -04:00
Kent Overstreet
ee7570546e bcachefs: Fix a deadlock
Waiting on a btree node write with btree locks held can deadlock, if the
write errors: the write error path has to do do a btree update to drop
the pointer to the replica that errored.

The interior update path has to wait on in flight btree writes before
freeing nodes on disk. Previously, this was done in
bch2_btree_interior_update_will_free_node(), and could deadlock; now, we
just stash a pointer to the node and do it in
btree_update_nodes_written(), just prior to the transactional part of
the update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:05 -04:00
Kent Overstreet
9f2772c454 bcachefs: Split out btree_error_wq
We can't use btree_update_wq becuase btree updates may be waiting on
btree writes to complete.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:04 -04:00
Kent Overstreet
9dd89a05fd bcachefs: Fix an issue with inconsistent btree writes after unclean shutdown
After unclean shutdown, btree writes may have completed on one device
and not others - and this inconsistency could lead us to writing new
bsets with a gap in our btree node in one of our replicas.

Fortunately, this is only an issue with bsets that are newer than the
most recent journal flush, and we already have a mechanism for detecting
and blacklisting those. We just need to make sure to start new btree
writes after the most recent _non_ blacklisted bset.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:04 -04:00
Kent Overstreet
731bdd2eff bcachefs: Add a workqueue for btree io completions
Also, clean up workqueue usage - we shouldn't be using system
workqueues, pretty much everything we do needs to be on our own
WQ_MEM_RECLAIM workqueues.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:04 -04:00
Kent Overstreet
1ce0cf5fe9 bcachefs: Add a debug mode that always reads from every btree replica
There's a new module parameter, verify_all_btree_replicas, that enables
reading from every btree replica when reading in btree nodes and
comparing them against each other. We've been seeing some strange btree
corruption - this will hopefully aid in tracking it down and catching it
more often.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:04 -04:00
Dan Robertson
5bc38f44fa bcachefs: Fix oob write in __bch2_btree_node_write
Fix a possible out of bounds write in __bch2_btree_node_write when
the data buffer padding is cleared up to the block size. The out of
bounds write is possible if the data buffers size is not a multiple
of the block size.

Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Kent Overstreet
aae15aafcd bcachefs: New and improved topology repair code
This splits out btree topology repair into a separate pass, and makes
some improvements:
 - When we have to pick which of two overlapping nodes to drop keys
   from, we use the btree node header sequence number to preserve the
   newer node

 - the gc code has been changed so that it doesn't bail out if we're
   continuing/ignoring on fsck error - this way the dump tool can skip
   running the repair pass but still walk all reachable metadata

 - add a new superblock flag indicating when a filesystem is known to
   have btree topology issues, and the topology repair pass should be
   run

 - changing the start/end of a node might mean keys in that node have to
   be deleted: this patch handles that better by splitting it out into a
   separate function and running it explicitly in the topology repair
   code, previously those keys were only being dropped when the btree
   node was read in.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
bcd25dac53 bcachefs: Rewrite btree nodes with errors
This patch adds self healing functionality for btree nodes - if we
notice a problem when reading a btree node, we just rewrite it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
51c804ed2a bcachefs: Punt btree writes to workqueue to submit
We don't want to be submitting IO with btree locks held, and btree
writes usually aren't latency sensitive.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
2177147b39 bcachefs: Improve bset compaction
The previous patch that fixed btree nodes being written too aggressively
now meant that we weren't sorting btree node bsets optimally - this
patch fixes that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:59 -04:00
Kent Overstreet
ba5f03d362 bcachefs: Add a sysfs var for average btree write size
Useful number for performance tuning.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:58 -04:00
Kent Overstreet
5f65d74d79 bcachefs: Add repair code for out of order keys in a btree node.
This just drops the offending key - in the bug report where this was
seen, it was clearly a single bit memory error, and fsck will fix the
missing key.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:58 -04:00
Kent Overstreet
e751c01a8e bcachefs: Start using bpos.snapshot field
This patch starts treating the bpos.snapshot field like part of the key
in the btree code:

* bpos_successor() and bpos_predecessor() now include the snapshot field
* Keys in btrees that will be using snapshots (extents, inodes, dirents
  and xattrs) now always have their snapshot field set to U32_MAX

The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that
determines whether we're iterating over keys in all snapshots or not -
internally, this controlls whether bkey_(successor|predecessor)
increment/decrement the snapshot field, or only the higher bits of the
key.

We add a new member to struct btree_iter, iter->snapshot: when
BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always
equal iter->snapshot, which will be 0 for btrees that don't use
snapshots, and alsways U32_MAX for btrees that will use snapshots
(until we enable snapshot creation).

This patch also introduces a new metadata version number, and compat
code for reading from/writing to older versions - this isn't a forced
upgrade (yet).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:57 -04:00
Kent Overstreet
4cf91b0270 bcachefs: Split out bpos_cmp() and bkey_cmp()
With snapshots, we're going to need to differentiate between comparisons
that should and shouldn't include the snapshot field. bpos_cmp is now
the comparison function that does include the snapshot field, used by
core btree code.

Upper level filesystem code generally does _not_ want to compare against
the snapshot field - that code wants keys to compare as equal even when
one of them is in an ancestor snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:57 -04:00
Kent Overstreet
0390ea8ad8 bcachefs: Drop bkey noops
Bkey noops were introduced to deal with trimming inline data extents in
place in the btree: if the u64s field of a bkey was 0, that u64 was a
noop and we'd start looking for the next bkey immediately after it.

But extent handling has been lifted above the btree - we no longer
modify existing extents in place in the btree, and the compatibilty code
for old style extent btree nodes is gone, so we can completely drop this
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:57 -04:00
Kent Overstreet
84cc758d6b bcachefs: Validate bset version field against sb version fields
The superblock version fields need to be accurate to know whether a
filesystem is supported, thus we should be verifying them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:56 -04:00
Kent Overstreet
50dc0f692a bcachefs: Require all btree iterators to be freed
We keep running into occasional bugs with btree transaction iterators
overflowing - this will make those bugs more visible.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:56 -04:00
Kent Overstreet
f020bfcdb0 bcachefs: Use bch2_bpos_to_text() more consistently
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Kent Overstreet
2436cb9fad bcachefs: Use x-macros for more enums
This patch standardizes all the enums that have associated string tables
(probably more enums should have string tables).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Kent Overstreet
41f8b09edc bcachefs: Rename BTREE_ID enums for consistency with other enums
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Kent Overstreet
c052cf82f3 bcachefs: KEY_TYPE_discard is no longer used
KEY_TYPE_discard used to be used for extent whiteouts, but when handling
over overlapping extents was lifted above the core btree code it became
unused. This patch updates various code to reflect that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Kent Overstreet
f2785955bb bcachefs: Kill support for !BTREE_NODE_NEW_EXTENT_OVERWRITE()
bcachefs has been aggressively migrating filesystems and btree nodes to
the new format for quite some time - this shouldn't affect anyone
anymore, and lets us delete a _lot_ of code. Also, it frees up
KEY_TYPE_discard for a new whiteout key type for snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00