Commit Graph

73 Commits

Author SHA1 Message Date
Pei Xiao
93d53f1caf bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket
bio_kmalloc may return NULL, will cause NULL pointer dereference.
Add check NULL return for bio_kmalloc in journal_read_bucket.

Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Fixes: ac10a9611d ("bcachefs: Some fixes for building in userspace")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:21 -05:00
Gaosheng Cui
ca959e328b bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get()
The function ec_new_stripe_head_alloc() returns nullptr if kzalloc()
fails. It is crucial to verify its return value before dereferencing
it to avoid a potential nullptr dereference.

Fixes: 035d72f72c ("bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-29 06:34:10 -04:00
Gianfranco Trad
2045fc4295 bcachefs: Fix invalid shift in validate_sb_layout()
Add check on layout->sb_max_size_bits against BCH_SB_LAYOUT_SIZE_BITS_MAX
to prevent UBSAN shift-out-of-bounds in validate_sb_layout().

Reported-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=089fad5a3a5e77825426
Fixes: 03ef80b469 ("bcachefs: Ignore unknown mount options")
Tested-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com
Signed-off-by: Gianfranco Trad <gianf.trad@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-24 17:41:43 -04:00
Kent Overstreet
19773ec997 bcachefs: Disk accounting device validation fixes
- Fix failure to validate that accounting replicas entries point to
  valid devices: this wasn't a real bug since they'd be cleaned up by
  GC, but is still something we should know about

- Fix failure to validate that dev_data_type entries point to valid
  devices: this does fix a real bug, since bch2_accounting_read() would
  then try to copy the counters to that device and pop an inconsistent
  error when the device didn't exist

- Remove accounting entries that are zeroed or invalid: if we're not
  validating them we need to get rid of them: they might not exist in
  the superblock, so we need the to trigger the superblock mark path
  when they're readded.

  This fixes the replication.ktest rereplicate test, which was failing
  with "superblock not marked for replicas..."

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:53 -04:00
Kent Overstreet
ad8d1f77fc bcachefs: bch2_dev_remove_stripes()
We can now correctly force-remove a device that has stripes on it; this
uses the new BCH_SB_MEMBER_INVALID sentinal value.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
54a12984a9 bcachefs: EIO errcode cleanup
We want to be using private errcodes whenever possible, for better error
messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
a977f3e162 bcachefs: BCH_WRITE_ALLOC_NOWAIT no longer applies to open bucket allocation
rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't
allocate from the full filesystem - but we don't want spurious
allocation failures due to open buckets.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:35:20 -04:00
Kent Overstreet
e3e6940940 bcachefs: Revert lockless buffered IO path
We had a report of data corruption on nixos when building installer
images.

https://github.com/NixOS/nixpkgs/pull/321055#issuecomment-2184131334

It seems that writes are being dropped, but only when issued by QEMU,
and possibly only in snapshot mode. It's undetermined if it's write
calls are being dropped or dirty folios.

Further testing, via minimizing the original patch to just the change
that skips the inode lock on non appends/truncates, reveals that it
really is just not taking the inode lock that causes the corruption: it
has nothing to do with the other logic changes for preserving write
atomicity in corner cases.

It's also kernel config dependent: it doesn't reproduce with the minimal
kernel config that ktest uses, but it does reproduce with nixos's distro
config. Bisection the kernel config initially pointer the finger at page
migration or compaction, but it appears that was erroneous; we haven't
yet determined what kernel config option actually triggers it.

Sadly it appears this will have to be reverted since we're getting too
close to release and my plate is full, but we'd _really_ like to fully
debug it.

My suspicion is that this patch is exposing a preexisting bug - the
inode lock actually covers very little in IO paths, and we have a
different lock (the pagecache add lock) that guards against races with
truncate here.

Fixes: 7e64c86cdc ("bcachefs: Buffered write path now can avoid the inode lock")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-31 19:26:08 -04:00
Kent Overstreet
d97de0d017 bcachefs: Make bkey_fsck_err() a wrapper around fsck_err()
bkey_fsck_err() was added as an interface that looks like fsck_err(),
but previously all it did was ensure that the appropriate error counter
was incremented in the superblock.

This is a cleanup and bugfix patch that converts it to a wrapper around
fsck_err(). This is needed to fix an issue with the upgrade path to
disk_accounting_v3, where the "silent fix" error list now includes
bkey_fsck errors; fsck_err() handles this in a unified way, and since we
need to change printing of bkey fsck errors from the caller to the inner
bkey_fsck_err() calls, this ends up being a pretty big change.

Als,, rename .invalid() methods to .validate(), for clarity, while we're
changing the function signature anyways (to drop the printbuf argument).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Thomas Bertschinger
1c12d1caf8 bcachefs: Add error code to defer option parsing
This introduces a new error code, option_needs_open_fs, which is used to
indicate that an attempt was made to parse a mount option prior to
opening a filesystem, when that mount option requires an open filesystem
in order to be validated.

Returning this error results in bch2_parse_one_mount_opt() saving that
option for later parsing, after the filesystem is opened.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
504794067f bcachefs: Replace bare EEXIST with private error codes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21 10:17:07 -04:00
Kent Overstreet
db42549d40 bcachefs: Add a better limit for maximum number of buckets
The bucket_gens array is a single array allocation (one byte per
bucket), and kernel allocations are still limited to INT_MAX.

Check this limit to avoid failing the bucket_gens array allocation.

Reported-by: syzbot+b29f436493184ea42e2b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06 10:58:17 -04:00
Kent Overstreet
ec9cc18fc2 bcachefs: Add checks for invalid snapshot IDs
Previously, we assumed that keys were consistent with the snapshots
btree - but that's not correct as fsck may not have been run or may not
be complete.

This adds checks and error handling when using the in-memory snapshots
table (that mirrors the snapshots btree).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Hongbo Li
2a68d611a1 bcachefs: intercept mountoption value for bool type
For mount option with bool type, the value must be 0 or 1 (See
bch2_opt_parse). But this seems does not well intercepted cause
for other value(like 2...), it returns the unexpect return code
with error message printed.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Kent Overstreet
7e64c86cdc bcachefs: Buffered write path now can avoid the inode lock
Non append, non extending buffered writes can now avoid taking the inode
lock.

To ensure atomicity of writes w.r.t. other writes, we lock every folio
that we'll be writing to, and if this fails we fall back to taking the
inode lock.

Extensive comments are provided as to corner cases.

Link: https://lore.kernel.org/linux-fsdevel/Zdkxfspq3urnrM6I@bombadil.infradead.org/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:26 -04:00
Hongbo Li
79162e829b bcachefs: fix the error code when mounting with incorrect options.
When mount with incorrect options such as:
"mount -t bcachefs -o errors=back /dev/loop1 /mnt/bcachefs/".
It rebacks the error "mount: /mnt/bcachefs: permission denied."
 cause bch2_parse_mount_opts returns -1 and bch2_mount throws
it up. This is unreasonable.

The real error message should be like this:
"mount: /mnt/bcachefs: wrong fs type, bad option, bad
superblock on /dev/loop1, missing codepage or helper program,
or other error."

Adding three private error codes for mounting error. Here are:
  - BCH_ERR_mount_option as the parent class for option error.
  - BCH_ERR_option_name represents the invalid option name.
  - BCH_ERR_option_value represents the invalid option value.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
eb386617be bcachefs: Errcode tracepoint, documentation
Add a tracepoint for downcasting private errors to standard errors, so
they can be recovered even when not logged; also, add some
documentation.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
835cd3e147 bcachefs: Check for subvolume children when deleting subvolumes
Recursively destroying subvolumes isn't allowed yet.

Fixes: https://github.com/koverstreet/bcachefs/issues/634
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:24 -04:00
Kent Overstreet
52946d828a bcachefs: Kill more -EIO error codes
This converts -EIOs related to btree node errors to private error codes,
which will help with some ongoing debugging by giving us better error
messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:23 -04:00
Kent Overstreet
ce3e9283de bcachefs: Kill some -EINVALs
Repurposing standard error codes in bcachefs code is banned in new code,
and we need to get rid of the remaining ones - private error codes give
us much better error messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:09 -04:00
Kent Overstreet
0d529663f0 bcachefs: Split brain detection
Use the new bch_member->seq, sb->write_time fields to detect split brain
and kick out devices when necessary.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:19 -05:00
Kent Overstreet
53b67d8dcf bcachefs: better error message in btree_node_write_work()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:42 -05:00
Kent Overstreet
cee0a8ea6d bcachefs: Improve the nopromote tracepoint
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:42 -05:00
Kent Overstreet
447c1c0105 bcachefs: check for failure to downgrade
With the upcoming member seq patch, it's now critical that we don't ever
write to a superblock that hasn't been version downgraded - failure to
update member seq fields will cause split brain detection to fire
erroniously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:42 -05:00
Kent Overstreet
09caeabe1a bcachefs: btree write buffer now slurps keys from journal
Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.

This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.

Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.

We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.

The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet
56ec287d30 bcachefs: BCH_ERR_opt_parse_error
Continuing the project of replacing generic error codes with more
specific ones.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:40 -05:00
Kent Overstreet
573224301c bcachefs: Make journal replay more efficient
Journal replay now first attempts to replay keys in sorted order,
similar to how the btree write buffer flush path works.

Any keys that can not be replayed due to journal deadlock are then left
for later and replayed in journal order, unpinning journal entries as we
go.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
84f1638795 bcachefs: bch_sb_field_downgrade
Add a new superblock section that contains a list of
  { minor version, recovery passes, errors_to_fix }

that is - a list of recovery passes that must be run when downgrading
past a given version, and a list of errors to silently fix.

The upcoming disk accounting rewrite is not going to be fully
compatible: we're going to have to regenerate accounting both when
upgrading to the new version, and also from downgrading from the new
version, since the new method of doing disk space accounting is a
completely different architecture based on deltas, and synchronizing
them for every jounal entry write to maintain compatibility is going to
be too expensive and impractical.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:07 -05:00
Kent Overstreet
8b16413cda bcachefs: bch_sb.recovery_passes_required
Add two new superblock fields. Since the main section of the superblock
is now fully, we have to add a new variable length section for them -
bch_sb_field_ext.

 - recovery_passes_requried: recovery passes that must be run on the
   next mount
 - errors_silent: errors that will be silently fixed

These are to improve upgrading and dwongrading: these fields won't be
cleared until after recovery successfully completes, so there won't be
any issues with crashing partway through an upgrade or a downgrade.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:07 -05:00
Kent Overstreet
d5bd37872a bcachefs: Add missing validation for jset_entry_data_usage
Validation was completely missing for replicas entries in the journal
(not the superblock replicas section) - we can't have replicas entries
pointing to invalid devices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28 17:18:24 -05:00
Kent Overstreet
7d9f8468ff bcachefs: Data update path won't accidentaly grow replicas
Previously, there was a bug where if an extent had greater durability
than required (because we needed to move a durability=1 pointer and
ended up putting it on a durability 2 device), we would submit a write
for replicas=2 - the durability of the pointer being rewritten - instead
of the number of replicas required to bring it back up to the
data_replicas option.

This, plus the allocation path sometimes allocating on a greater
durability device than requested, meant that extents could continue
having more and more replicas added as they were being rewritten.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-25 21:48:42 -05:00
Kent Overstreet
a973de85e3 bcachefs: Replace ERANGE with private error codes
We avoid using standard error codes: private, per-callsite error codes
make debugging easier.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05 13:12:18 -05:00
Kent Overstreet
f5d26fa31e bcachefs: bch_sb_field_errors
Add a new superblock section to keep counts of errors seen since
filesystem creation: we'll be addingcounters for every distinct fsck
error.

The new superblock section has entries of the for [ id, count,
time_of_last_error ]; this is intended to let us see what errors are
occuring - and getting fixed - via show-super output.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-01 21:11:08 -04:00
Kent Overstreet
6ddedca218 bcachefs: Guard against unknown compression options
Since compression options now include compression level, proper
validation is a bit more involved.

This adds bch2_compression_opt_valid(), and plumbs it around
appropriately.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31 12:18:37 -04:00
Hunter Shaffer
3f7b9713da bcachefs: New superblock section members_v2
members_v2 has dynamically resizable entries so that we can extend
bch_member. The members can no longer be accessed with simple array
indexing Instead members_v2_get is used to find a member's exact
location within the array and returns a copy of that member.
Alternatively member_v2_get_mut retrieves a mutable point to a member.

Signed-off-by: Hunter Shaffer <huntershaffer182456@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:15 -04:00
Kent Overstreet
40a53b9215 bcachefs: More minor smatch fixes
- fix a few uninitialized return values
 - return a proper error code in lookup_lostfound()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:14 -04:00
Kent Overstreet
feb5cc3981 bcachefs: trace_read_nopromote()
Add a tracepoint to print the reason a read wasn't promoted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:12 -04:00
Kent Overstreet
1809b8cba7 bcachefs: Break up io.c
More reorganization, this splits up io.c into
 - io_read.c
 - io_misc.c - fallocate, fpunch, truncate
 - io_write.c

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:12 -04:00
Kent Overstreet
56046e3ecc bcachefs: Convert btree_err_type to normal error codes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:09 -04:00
Kent Overstreet
922bc5a037 bcachefs: Make topology repair a normal recovery pass
This adds bch2_run_explicit_recovery_pass(), for rewinding recovery and
explicitly running a specific recovery pass - this is a more general
replacement for how we were running topology repair before.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:08 -04:00
Kent Overstreet
ae2e13d780 bcachefs: bch2_run_explicit_recovery_pass()
This introduces bch2_run_explicit_recovery_pass() and uses it for when
fsck detects that we need to re-run dead snaphots cleanup, and makes
dead snapshot cleanup more like a normal recovery pass.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:08 -04:00
Kent Overstreet
43b81a4eac bcachefs: overlapping_extents_found()
This improves the repair path for overlapping extents - we now verify
that we find in the btree the overlapping extents that the algorithm
detected, and fail the fsck run with a more useful error if it doesn't
match.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:07 -04:00
Kent Overstreet
7c50140fce bcachefs: Convert more -EROFS to private error codes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
e9d017234f bcachefs: BCH_ERR_fsck -> EINVAL
When we return errors outside of bcachefs, we need to return a standard
error code - fix this for BCH_ERR_fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:04 -04:00
Kent Overstreet
9473cff989 bcachefs: Fix more lockdep splats in debug.c
Similar to previous fixes, we can't incur page faults while holding
btree locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:04 -04:00
Kent Overstreet
3ebfc8fe95 bcachefs: Use unlikely() in bch2_err_matches()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
e47a390aa5 bcachefs: Convert -ENOENT to private error codes
As with previous conversions, replace -ENOENT uses with more informative
private error codes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
1c59b483a3 bcachefs: BTREE_ID_snapshot_tree
This adds a new btree which gets us a persistent per-snapshot-tree
identifier.

 - BTREE_ID_snapshot_trees
 - KEY_TYPE_snapshot_tree
 - bch_snapshot now has a field that points to a snapshot_tree

This is going to be used to designate one snapshot ID/subvolume out of a
given tree of snapshots as the "main" subvolume, so that we can do quota
accounting in that subvolume and not the rest.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
51e84d3bbf bcachefs: bch2_bkey_get_empty_slot()
Add a new helper for allocating a new slot in a btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
65d48e3525 bcachefs: Private error codes: ENOMEM
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00