Commit Graph

1217716 Commits

Author SHA1 Message Date
Kent Overstreet
d95dd378c2 bcachefs: allocate_dropping_locks()
Add two new helpers for allocating memory with btree locks held: The
idea is to first try the allocation with GFP_NOWAIT|__GFP_NOWARN, then
if that fails - unlock, retry with GFP_KERNEL, and then call
trans_relock().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
3ebfc8fe95 bcachefs: Use unlikely() in bch2_err_matches()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
4c4a8f20d1 bcachefs: Fix error handling in promote path
The promote path had a BUG_ON() for unknown error type, which we're now
seeing: change it to a WARN_ON() - because we're curious what this is -
and otherwise handle it in the normal error path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
5718fda0b5 bcachefs: fs-io: Eliminate GFP_NOFS usage
GFP_NOFS doesn't ever make sense. If we're allocatingc memory it should
be GFP_NOWAIT if btree locks are held, GFP_KERNEL otherwise.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
78367aaa5a bcachefs: bch2_trans_kmalloc no longer allocates memory with btree locks held
When allocating memory, gfp flags should generally be

 - GFP_NOWAIT|__GFP_NOWARN if btree locks are held
 - GFP_NOFS if in the IO path or otherwise holding resources needed for
   IO submission
 - GFP_KERNEL otherwise

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
b5fd75669a bcachefs: drop_locks_do()
Add a new helper for the common pattern of:
 - trans_unlock()
 - do something
 - trans_relock()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
19c304bebd bcachefs: GFP_NOIO -> GFP_NOFS
GFP_NOIO dates from the bcache days, when we operated under the block
layer. Now, GFP_NOFS is more appropriate, so switch all GFP_NOIO uses to
GFP_NOFS.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
e1d29c5fa1 bcachefs: Ensure bch2_btree_node_get() calls relock() after unlock()
Fix a bug where bch2_btree_node_get() might call bch2_trans_unlock() (in
fill) without calling bch2_trans_relock(); this is a bug when it's done
in the core btree code.

Also, twea bch2_btree_node_mem_alloc() to drop btree locks before doing
a blocking memory allocation.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
70d41c9e27 bcachefs: Avoid __GFP_NOFAIL
We've been using __GFP_NOFAIL for allocating struct bch_folio, our
private per-folio state.

However, that struct is variable size - it holds state for each sector
in the folio, and folios can be quite large now, which means it's
possible for bch_folio to be larger than PAGE_SIZE now.

__GFP_NOFAIL allocations are undesirable in normal circumstances, but
particularly so at >= PAGE_SIZE, and warnings are emitted for that.

So, this patch adds proper error paths and eliminates most uses of
__GFP_NOFAIL. Also, do some more cleanup of gfp flags w.r.t. btree node
locks: we can use GFP_KERNEL, but only if we're not holding btree locks,
and if we are holding btree locks we should be using GFP_NOWAIT.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
ad520141b1 bcachefs: Fix corruption with writeable snapshots
When partially overwriting an extent in an older snapshot, the existing
extent has to be split.

If the existing extent was overwritten in a different (sibling)
snapshot, we have to ensure that the split won't be visible in the
sibling snapshot.

data_update.c already has code for this,
bch2_insert_snapshot_writeouts() - we just need to move it into
btree_update_leaf.c and change bch2_trans_update_extent() to use it as
well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
e47a390aa5 bcachefs: Convert -ENOENT to private error codes
As with previous conversions, replace -ENOENT uses with more informative
private error codes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
f154c3eb42 bcachefs: trans_for_each_path_safe()
bch2_btree_trans_to_text() is used on btree_trans objects that are owned
by different threads - when printing out deadlock cycles - so we need a
safe version of trans_for_each_path(), else we race with seeing a
btree_path that was just allocated and not fully initialized:

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
e7ffda565a bcachefs: Fix a quota read bug
bch2_fs_quota_read() could see an inode that's been deleted
(KEY_TYPE_inode_generation) - bch2_fs_quota_read_inode() needs to check
for that instead of erroring.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
c26463ce99 bcachefs: Fix move_extent_fail counter
fail counters need to be events, not numbers of sectors - or the
calculations the tests use for determining if we've had too many
slowpath events don't work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
fc0ee376bb bcachefs: Don't reuse reflink btree keyspace
We've been seeing difficult to debug "missing indirect extent" bugs,
that fsck doesn't seem to find.

One possibility is that there was a missing indirect extent, but then a
new indirect extent was created at the location of the previous indirect
extent.

This patch eliminates that possibility by always creating new indirect
extents right after the last one, at the end of the reflink btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
db32bb9a5f mean and variance: Add a missing include
abs() is in math.h

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
65bc410907 mean and variance: More tests
Add some more tests that test conventional and weighted mean
simultaneously, and with a table of values that represents events that
we'll be using this to look for so we can verify-by-eyeball that the
output looks sane.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
aab5e0972a six locks: Disable percpu read lock mode in userspace
When running in userspace, we currently don't have a real percpu
implementation available - at least in bcachefs-tools, which is where
this code is currently used in userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
2d9200cfe0 six locks: Use atomic_try_cmpxchg_acquire()
This switches to a newer cmpxchg variant which updates @old for us on
failure, simplifying the cmpxchg loops a bit and supposedly generating
better code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
c4687a4a75 six locks: Fix an unitialized var
In the conversion to atomic_t, six_lock_slowpath() ended up calling
six_lock_wakeup() in the failure path with a state variable that was
never initialized - whoops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
96e53e909d six locks: Delete redundant comment
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
2ab62310fd six locks: Tiny bit more tidying
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
32913f49f5 six locks: Seq now only incremented on unlock
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
2804d0f15b six locks: Split out seq, use atomic_t instead of atomic64_t
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
a4e9e1f0cb six locks: Single instance of six_lock_vals
Since we're not generating different versions of the lock functions for
each lock type, the constant propagation we were trying to do before is
no longer useful - this is now a small code size decrease.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
357c126152 six_locks: Kill test_bit()/set_bit() usage
This deletes the crazy cast-atomic-to-unsigned-long, and replaces them
with atomic_and() and atomic_or().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
b60c8e9e7b six locks: lock->state.seq no longer used for write lock held
lock->state.seq is shortly being moved out of lock->state, to kill the
depedency on atomic64; in preparation for that, we change the write
locking bit to write locked.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
dc88b65f3e six locks: Simplify six_relock()
The next patch is going to move lock->seq out of lock->state. This
replaces six_relock() with a much simpler implementation based on
trylock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
37f612bea5 six locks: Improve spurious wakeup handling in pcpu reader mode
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
91d16f16d0 six locks: Documentation, renaming
- Expanded and revamped overview documentation in six.h, giving an
   overview of all features
 - docbook-comments for all external interfaces
 - Rename some functions for simplicity, i.e.
   six_lock_ip_type() -> six_lock_ip()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
1fb4fe6317 six locks: Kill six_lock_state union
As suggested by Linus, this drops the six_lock_state union in favor of
raw bitmasks.

On the one hand, bitfields give more type-level structure to the code.
However, a significant amount of the code was working with
six_lock_state as a u64/atomic64_t, and the conversions from the
bitfields to the u64 were deemed a bit too out-there.

More significantly, because bitfield order is poorly defined (#ifdef
__LITTLE_ENDIAN_BITFIELD can be used, but is gross), incrementing the
sequence number would overflow into the rest of the bitfield if the
compiler didn't put the sequence number at the high end of the word.

The new code is a bit saner when we're on an architecture without real
atomic64_t support - all accesses to lock->state now go through
atomic64_*() operations.

On architectures with real atomic64_t support, we additionally use
atomic bit ops for setting/clearing individual bits.

Text size: 7467 bytes -> 4649 bytes - compilers still suck at
bitfields.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
c4bd3491b1 six locks: Simplify dispatch
Originally, we used inlining/flattening to cause the compiler to
generate different versions of lock/trylock/relock/unlock for each lock
type - read, intent, and write. This made the individual functions
smaller and let the compiler eliminate table lookups: however, as the
code has gotten more complicated these optimizations have gotten less
worthwhile, and all the tricky inlining and dispatching made the code
less readable.

Text size: 11015 bytes -> 7467 bytes, and benchmarks show no loss of
performance.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
d2c86b77de six locks: Centralize setting of waiting bit
Originally, the waiting bit was always set by trylock() on failure:
however, it's now set by __six_lock_type_slowpath(), with wait_lock held
- which is the more correct place to do it.

That made setting the waiting bit in trylock redundant, so this patch
deletes that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:02 -04:00
Kent Overstreet
0157f9c5a7 six locks: Remove hacks for percpu mode lost wakeup
The lost wakeup bug hasn't been observed in awhile, and we're trying to
provoke it and determine if it still exists.

This patch removes some defenses that were added to attempt to track it
down; if it still exists, this should make it easier to see it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
0d2234a79e six locks: Kill six_lock_pcpu_(alloc|free)
six_lock_pcpu_alloc() is an unsafe interface: it's not safe to allocate
or free the percpu reader count on an existing lock that's in use, the
only safe time to allocate percpu readers is when the lock is first
being initialized.

This patch adds a flags parameter to six_lock_init(), and instead of
six_lock_pcpu_free() we now expose six_lock_exit(), which does the same
thing but is less likely to be misused.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
01bf56a977 six locks: six_lock_readers_add()
This moves a helper out of the bcachefs code that shouldn't have been
there, since it touches six lock internals.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
f375d6ca58 bcachefs: Don't call local_clock() twice in trans_begin()
local_clock() is not as cheap as we'd like it to be, alas

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
962210b281 bcachefs: Fix a buffer overrun in bch2_fs_usage_read()
We were copying the size of a struct bch_fs_usage_online to a struct
bch_fs_usage, which is 8 bytes smaller.

This adds some new helpers so we can do this correctly, and get rid of
some magic +1s too.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
0b438c5bfa bcachefs: Clear btree_node_just_written() when node reused or evicted
This fixes the following bug:

Journal reclaim attempts to flush a node, but races with the node being
evicted from the btree node cache; when we lock the node, the data
buffers have already been freed.

We don't evict a node that's dirty, so calling btree_node_write() is
fine - it's a noop - except that the btree_node_just_written bit causes
bch2_btree_post_write_cleanup() to run (resorting the node), which then
causes a null ptr deref.

00078 Unable to handle kernel NULL pointer dereference at virtual address 000000000000009e
00078 Mem abort info:
00078   ESR = 0x0000000096000005
00078   EC = 0x25: DABT (current EL), IL = 32 bits
00078   SET = 0, FnV = 0
00078   EA = 0, S1PTW = 0
00078   FSC = 0x05: level 1 translation fault
00078 Data abort info:
00078   ISV = 0, ISS = 0x00000005
00078   CM = 0, WnR = 0
00078 user pgtable: 4k pages, 39-bit VAs, pgdp=000000007ed64000
00078 [000000000000009e] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
00078 Internal error: Oops: 0000000096000005 [#1] SMP
00078 Modules linked in:
00078 CPU: 75 PID: 1170 Comm: stress-ng-utime Not tainted 6.3.0-ktest-g5ef5b466e77e #2078
00078 Hardware name: linux,dummy-virt (DT)
00078 pstate: 80001005 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00078 pc : btree_node_sort+0xc4/0x568
00078 lr : bch2_btree_post_write_cleanup+0x6c/0x1c0
00078 sp : ffffff803e30b350
00078 x29: ffffff803e30b350 x28: 0000000000000001 x27: ffffff80076e52a8
00078 x26: 0000000000000002 x25: 0000000000000000 x24: ffffffc00912e000
00078 x23: ffffff80076e52a8 x22: 0000000000000000 x21: ffffff80076e52bc
00078 x20: ffffff80076e5200 x19: 0000000000000000 x18: 0000000000000000
00078 x17: fffffffff8000000 x16: 0000000008000000 x15: 0000000008000000
00078 x14: 0000000000000002 x13: 0000000000000000 x12: 00000000000000a0
00078 x11: ffffff803e30b400 x10: ffffff803e30b408 x9 : 0000000000000001
00078 x8 : 0000000000000000 x7 : ffffff803e480000 x6 : 00000000000000a0
00078 x5 : 0000000000000088 x4 : 0000000000000000 x3 : 0000000000000010
00078 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80076e52a8
00078 Call trace:
00078  btree_node_sort+0xc4/0x568
00078  bch2_btree_post_write_cleanup+0x6c/0x1c0
00078  bch2_btree_node_write+0x108/0x148
00078  __btree_node_flush+0x104/0x160
00078  bch2_btree_node_flush0+0x1c/0x30
00078  journal_flush_pins.constprop.0+0x184/0x2d0
00078  __bch2_journal_reclaim+0x4d4/0x508
00078  bch2_journal_reclaim+0x1c/0x30
00078  __bch2_journal_preres_get+0x244/0x268
00078  bch2_trans_journal_preres_get_cold+0xa4/0x180
00078  __bch2_trans_commit+0x61c/0x1bb0
00078  bch2_setattr_nonsize+0x254/0x318
00078  bch2_setattr+0x5c/0x78
00078  notify_change+0x2bc/0x408
00078  vfs_utimes+0x11c/0x218
00078  do_utimes+0x84/0x140
00078  __arm64_sys_utimensat+0x68/0xa8
00078  invoke_syscall.constprop.0+0x54/0xf0
00078  do_el0_svc+0x48/0xd8
00078  el0_svc+0x14/0x48
00078  el0t_64_sync_handler+0xb0/0xb8
00078  el0t_64_sync+0x14c/0x150
00078 Code: 8b050265 910020c6 8b060266 910060ac (79402cad)
00078 ---[ end trace 0000000000000000 ]---

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
faa62a2036 bcachefs: alloc_v4_u64s() fix
With the recent bkey_ops.min_val_size addition, bkey values are
automatically extended to the size of the current version.

The check in bch2_alloc_v4_invalid() needs to be updated to take this
into account.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
a49bd8c007 bcachefs: Delete an incorrect bch2_trans_unlock()
These deletes a bch2_trans_unlock() call from __bch2_move_data(). It was
redundant; bch2_move_extent() has the correct unlock call, and it was
buggy because when move_extent calls bch2_extent_drop_ptrs() we don't
want the transaction to be unlocked yet - this fixes a btree_iter.c
assertion.

Fixes https://github.com/koverstreet/bcachefs/issues/511.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
d598a9b7e2 bcachefs: Use memcpy_u64s_small() for copying keys
Small performance optimization; an open coded loop is better than rep ;
movsq for small copies.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
73da30e8e0 bcachefs: Fix check_overlapping_extents()
A error check had a flipped conditional - whoops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
4a2e5d7ba5 bcachefs: Replace a BUG_ON() with fatal error
A user hit this BUG_ON() - it's unclear how it happened, so replace it
with a fatal error that will cause us to go read only, and print out
more information.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
92e637cef4 bcachefs: Delete some dead code in bch2_replicas_gc_end()
bch2_replicas_gc_(start|end) is now only used for journal replicas
entries, which don't have bucket sector counts - so this code is
entirely dead and can be deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Brian Foster
a7b29b8d9a bcachefs: mark journal replicas before journal write submission
The journal write submission path marks the associated replica
entries for journal data in journal_write_done(), which is just
after journal write bio submission. This creates a small window
where journal entries might have been written out, but the
associated replica is not marked such that recovery does not know
that the associated device contains journal data.

Move the replica marking a bit earlier in the write path such that
recovery is guaranteed to recognize that the device contains journal
data in the event of a crash.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
38e3d93fa1 bcachefs: Improved comment for bch2_replicas_gc2()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
cb1b479dc1 bcachefs: Fix quotas + snapshots
Now that we can reliably designate and find the master subvolume out of
a tree of snapshots, we can finally make quotas work with snapshots:

That is - quotas will now _ignore_ snapshot subvolumes, and only be in
effect for the master (non snapshot) subvolume.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
653693beea bcachefs: Add otime, parent to bch_subvolume
Add two new fields to bch_subvolume:
 - otime: creation time
 - parent: For snapshots, this is the id of the subvolume the snapshot
   was created from

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
1c59b483a3 bcachefs: BTREE_ID_snapshot_tree
This adds a new btree which gets us a persistent per-snapshot-tree
identifier.

 - BTREE_ID_snapshot_trees
 - KEY_TYPE_snapshot_tree
 - bch_snapshot now has a field that points to a snapshot_tree

This is going to be used to designate one snapshot ID/subvolume out of a
given tree of snapshots as the "main" subvolume, so that we can do quota
accounting in that subvolume and not the rest.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00