Commit Graph

1216515 Commits

Author SHA1 Message Date
Dan Robertson
faf1a5f417 bcachefs: Fix out of bounds read in fs usage ioctl
Fix a possible read out of bounds if bch2_ioctl_fs_usage is called when
replica_entries_bytes is set to a value that is smaller than the size
of bch_replicas_usage.

Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Dan Robertson
2b25de552f bcachefs: Fix null deref in bch2_ioctl_read_super
Do not attempt to cleanup the returned value of bch2_device_lookup if
the returned value was an error pointer. We currently check to see if
the returned value is null and run the cleanup otherwise. As a result,
we attempt to run the cleanup on a error pointer.

Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Dan Robertson
ec4ab9d2fc bcachefs: Fix possible null deref on mount
Ensure that the block device pointer in a superblock handle is not
null before dereferencing it in bch2_dev_to_fs. The block device pointer
may be null when mounting a new bcachefs filesystem given another mounted
bcachefs filesystem exists that has at least one device that is offline.

Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Dan Robertson
baf056b87d bcachefs: Fix error in parsing of mount options
When parsing the mount options duplicate the given options. This is
required as the options are parsed twice and strsep is used in parsing.
The options will be modified into a possibly invalid options set for the
second round of parsing if the options are not duplicated before
parsing.

Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Stijn Tintel
ffcf9ec78c bcachefs: avoid out-of-bounds in split_devs
Calling mount with an empty source string causes an out-of-bounds error
in split_devs. Check the length of the source string to avoid this.

Signed-off-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Kent Overstreet
909004d2f9 bcachefs: Make sure to use BTREE_ITER_PREFETCH in fsck
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Kent Overstreet
360746bf6f bcachefs: Fix bch2_btree_iter_peek_with_updates()
By not re-fetching the next update we were going into an infinite loop.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Kent Overstreet
933532b8b2 bcachefs: Fix reflink trigger
The trigger for reflink pointers wasn't always incrementing/decrementing
the refcounts correctly - this patch fixes that logic.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Kent Overstreet
3a402c8dab bcachefs: Fix some refcounting bugs
We really need debug mode assertions that ca->ref and ca->io_ref are
used correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Dan Robertson
5bc38f44fa bcachefs: Fix oob write in __bch2_btree_node_write
Fix a possible out of bounds write in __bch2_btree_node_write when
the data buffer padding is cleared up to the block size. The out of
bounds write is possible if the data buffers size is not a multiple
of the block size.

Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Kent Overstreet
1784d43a88 bcachefs: Fix usage of last_seq + encryption
jset->last_seq is in the region that's encrypted - on journal write
completion, we were using it and getting garbage. This patch shadows it
to fix.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
ac1019d32b bcachefs: Clean up bch2_btree_and_journal_walk()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
e68031fb46 bcachefs: Mark newly allocated btree nodes as accessed
This was a major oversight - this means under memory pressure we can end
up reading in a btree node, then having it evicted before we get to use
it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
595c1e9bab bcachefs: Fix time handling
There were some overflows in the time conversion functions - fix this by
converting tv_sec and tv_nsec separately. Also, set sb->time_min and
sb->time_max.

Fixes xfstest generic/258.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
4f6dad46cb bcachefs: Add a tracepoint for when we block on journal reclaim
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
2ce867df31 bcachefs: Make sure to initialize j->last_flushed
If the journal reclaim thread makes it to the timeout without ever
initializing j->last_flushed, we could end up sleeping for a very long
time.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
050197b1c1 bcachefs: Ensure that fpunch updates inode timestamps
Fixes xfstests generic/059

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
d4b4422345 bcachefs: Change copygc wait amount to be min of per device waits
We're seeing a filesystem get stuck when all devices but one have no
more reclaimable buckets - because the copygc wait amount is curretly
filesystem wide.

This patch should fix that, possibly at the expensive of running too
much when only one or a few devices is full and the rebalance thread
needs to move data around.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
baa6502905 bcachefs: Change bch2_btree_key_cache_count() to exclude dirty keys
We're seeing livelocks that appear to be due to
bch2_btree_key_cache_scan repeatedly scanning and blocking other tasks
from using the key cache lock - we probably shouldn't be reporting
objects that can't actually be freed yet.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
d99af4f194 bcachefs: Call bch2_inconsistent_error() on missing stripe/indirect extent
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
3dea728ce6 bcachefs: New tracepoint for bch2_trans_get_iter()
Trying to debug an issue where after traverse_all() we shouldn't have to
traverse any iterators... yet we are

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
d36cdb045a bcachefs: Fix __bch2_trans_get_iter()
We need to also set iter->uptodate to indicate it needs to be traversed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
ceda1b9a17 bcachefs: Evict btree nodes we're deleting
There was a bug that led to duplicate btree node pointers being inserted
at the wrong level. The new topology repair code can fix that, except
that the btree cache code gets confused when we read in a btree node
from the pointer that was at the wrong level. This patch evicts nodes
that we're deleting to, which nicely solves the problem.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
fc51b041b7 bcachefs: New check_nlinks algorithm for snapshots
With snapshots, using a radix tree for the table of link counts won't
work anymore because we also need to distinguish between inodes with
different snapshot IDs. Instead, this patch builds up a sorted array of
inodes that have hardlinks that we can binary search on - taking
advantage of the fact that with inode backpointers, the check_nlinks()
pass _only_ needs to concern itself with inodes that have hardlinks now.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
e3b4b48c17 bcachefs: Fix a null ptr deref
Fix a few memory safety issues, found by asan in userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
aae15aafcd bcachefs: New and improved topology repair code
This splits out btree topology repair into a separate pass, and makes
some improvements:
 - When we have to pick which of two overlapping nodes to drop keys
   from, we use the btree node header sequence number to preserve the
   newer node

 - the gc code has been changed so that it doesn't bail out if we're
   continuing/ignoring on fsck error - this way the dump tool can skip
   running the repair pass but still walk all reachable metadata

 - add a new superblock flag indicating when a filesystem is known to
   have btree topology issues, and the topology repair pass should be
   run

 - changing the start/end of a node might mean keys in that node have to
   be deleted: this patch handles that better by splitting it out into a
   separate function and running it explicitly in the topology repair
   code, previously those keys were only being dropped when the btree
   node was read in.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
4932e07ea0 bcachefs: Fix key cache assertion
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
0098376f03 bcachefs: New helper __bch2_btree_insert_keys_interior()
Consolidate common parts of bch2_btree_insert_keys_interior() and
btree_split_insert_keys() - prep work for adding some new topology
assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
bcd25dac53 bcachefs: Rewrite btree nodes with errors
This patch adds self healing functionality for btree nodes - if we
notice a problem when reading a btree node, we just rewrite it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
8058b532ac bcachefs: Fix bch2_verify_keylist_sorted
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
bc2e5d5c66 bcachefs: Fix an out of bounds read
bch2_varint_decode() can read up to 7 bytes past the end of the buffer,
which means we need to allocate slightly larger key cache buffers.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
65c0601a32 bcachefs: Use mmap() instead of vmalloc_exec() in userspace
Calling mmap() directly is much better than malloc() then mprotect(), we
end up with much less address space fragmentation.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
537c32f521 bcachefs: Don't BUG_ON() btree topology error
This replaces an assertion in the btree merge path with a
bch2_inconsistent_error() - fsck will fix it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
1c8441bea5 bcachefs: Fix repair leading to replicas not marked
bch2_check_fix_ptrs() was being called after checking if the replicas
set was marked - but repair could change which replicas set needed to be
marked. Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
58686a259e bcachefs: Lookup/create lost+found lazily
This is prep work for subvolumes - each subvolume will have its own
lost+found.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
eb365fbc33 bcachefs: Don't BUG() in update_replicas
Apparently, we have a bug where in mark and sweep while accounting for a
key, a replicas entry isn't found. Change the code to print out the key
we couldn't mark and halt instead of a BUG_ON().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
f09517fc51 bcachefs: Fix a deadlock on journal reclaim
Flushing the btree key cache needs to use allocation reserves - journal
reclaim depends on flushing the btree key cache for making forward
progress, and the allocator and copygc depend on journal reclaim making
forward progress.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
6adaac0b95 bcachefs: Update bch2_btree_verify()
bch2_btree_verify() verifies that the btree node on disk matches what we
have in memory. This patch changes it to verify every replica, and also
fixes it for interior btree nodes - there's a mem_ptr field which is
used as a scratch space and needs to be zeroed out for comparing with
what's on disk.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
7b7278bbaf bcachefs: Fix two btree iterator leaks
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
51c804ed2a bcachefs: Punt btree writes to workqueue to submit
We don't want to be submitting IO with btree locks held, and btree
writes usually aren't latency sensitive.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
4d47b21c4d bcachefs: Fix a use after free
Turns out, we weren't waiting on in flight btree writes when freeing
existing btree nodes. This lead to stray btree writes overwriting newly
allocated buckets, but only started showing itself with some of the
recent allocator work and another patch to move submitting of btree
writes to worqueues.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
8ce600d447 bcachefs: Fix for btree_gc repairing interior btree ptrs
Using the normal transaction commit path to insert and journal updates
to interior nodes hadn't been done before this repair code was written,
not surprising that there was a bug.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
e95d7edfb7 bcachefs: Preallocate trans mem in bch2_migrate_index_update()
This will help avoid transaction restarts.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
89baec780f bcachefs: Allocator refactoring
This uses the kthread_wait_freezable() macro to simplify a lot of the
allocator thread code, along with cleaning up bch2_invalidate_bucket2().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
fa272f33bb bcachefs: Always check for invalid bkeys in trans commit path
We check for this prior to metadata being written, but we're seeing some
strange bugs lately, and this will help catch those closer to where they
occur.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
27cc532ef2 bcachefs: Check that keys are in the correct btrees
We've started seeing bug reports of pointers to btree nodes being
detected in leaf nodes. This should catch that before it's happened, and
it's something we should've been checking anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
04903131db bcachefs: Handle errors in bch2_trans_mark_update()
It's not actually the case that iterators are always checked here -
__bch2_trans_commit() checks for that after running triggers.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
6ad060b0eb bcachefs: Allocator thread doesn't need gc_lock anymore
Even with runtime gc (which currently isn't supported), runtime gc no
longer clears/recalculates the main set of bucket marks - it allocates
and calculates another set, updating the primary at the end.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
dac1525d9c bcachefs: gc shouldn't care about owned_by_allocator
The owned_by_allocator field is a purely in memory thing, even if/when
we bring back GC at runtime there's no need for it to be recalculating
this field. This is prep work for pulling it out of struct bucket, and
eventually getting rid of the bucket array.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
694015c2b1 bcachefs: Refactor bchfs_fallocate() to not nest btree_trans on stack
Upcoming patch is going to disallow multiple btree_trans on the stack.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00