Commit fa247089de ("dm: requeue IO if mapping table not yet available")
added a detection of whether the mapping table is available in the IO
submission process. If the mapping table is unavailable, it returns
BLK_STS_RESOURCE and requeues the IO.
This can lead to the following deadlock problem:
dm create mount
ioctl(DM_DEV_CREATE_CMD)
ioctl(DM_TABLE_LOAD_CMD)
do_mount
vfs_get_tree
ext4_get_tree
get_tree_bdev
sget_fc
alloc_super
// got &s->s_umount
down_write_nested(&s->s_umount, ...);
ext4_fill_super
ext4_load_super
ext4_read_bh
submit_bio
// submit and wait io end
ioctl(DM_DEV_SUSPEND_CMD)
dev_suspend
do_resume
dm_suspend
__dm_suspend
lock_fs
freeze_bdev
get_active_super
grab_super
// wait for &s->s_umount
down_write(&s->s_umount);
dm_swap_table
__bind
// set md->map(can't get here)
IO will be continuously requeued while holding the lock since mapping
table is NULL. At the same time, mapping table won't be set since the
lock is not available.
Like request-based DM, bio-based DM also has the same problem.
It's not proper to just abort IO if the mapping table not available.
So clear DM_SKIP_LOCKFS_FLAG when the mapping table is NULL, this
allows the DM table to be loaded and the IO submitted upon resume.
Fixes: fa247089de ("dm: requeue IO if mapping table not yet available")
Cc: stable@vger.kernel.org
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
dm-flakey returns error on reads if no other argument is specified.
This commit simplifies associated logic while formalizing an
"error_reads" argument and an ERROR_READS flag.
If no argument is specified, set ERROR_READS flag so that it behaves
just like before this commit.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Don't return a trailing space in the output of STATUSTYPE_TABLE.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
This command will crash with NULL pointer dereference:
dmsetup create flakey --table \
"0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 2 corrupt_bio_byte 512"
Fix the crash by checking if arg_name is non-NULL before comparing it.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
syzkaller found the following problematic rwsem locking (with write
lock already held):
down_read+0x9d/0x450 kernel/locking/rwsem.c:1509
dm_get_inactive_table+0x2b/0xc0 drivers/md/dm-ioctl.c:773
__dev_status+0x4fd/0x7c0 drivers/md/dm-ioctl.c:844
table_clear+0x197/0x280 drivers/md/dm-ioctl.c:1537
In table_clear, it first acquires a write lock
https://elixir.bootlin.com/linux/v6.2/source/drivers/md/dm-ioctl.c#L1520
down_write(&_hash_lock);
Then before the lock is released at L1539, there is a path shown above:
table_clear -> __dev_status -> dm_get_inactive_table -> down_read
https://elixir.bootlin.com/linux/v6.2/source/drivers/md/dm-ioctl.c#L773
down_read(&_hash_lock);
It tries to acquire the same read lock again, resulting in the deadlock
problem.
Fix this by moving table_clear()'s __dev_status() call to after its
up_write(&_hash_lock);
Cc: stable@vger.kernel.org
Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
There are no dm_get_queue_limits() callers outside of DM core and
there shouldn't be.
Also, remove its BUG_ON(!atomic_read(&md->holders)) to micro-optimize
__process_abnormal_io().
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Introduce max_write_zeroes_granularity and
max_secure_erase_granularity flags in the dm_target struct.
If a target sets these then DM core will split IO of these operation
types accordingly (in terms of max_write_zeroes_sectors and
max_secure_erase_sectors respectively).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Eliminate duplicate boilerplate code for simple modules that contain
a single DM target driver without any additional setup code.
Add a new module_dm() macro, which replaces the module_init() and
module_exit() with template functions that call dm_register_target()
and dm_unregister_target() respectively.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
clang with W=1 reports
drivers/md/dm-raid.c:2212:15: error: variable
'd' set but not used [-Werror,-Wunused-but-set-variable]
unsigned int d;
^
This variable is not used so remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Simplifies each DM target's init method by making dm_register_target()
responsible for its error reporting (on behalf of targets).
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Add io_err_io_hints() and set discard limits so that the zero target
advertises support for discards.
The error target will return -EIO for discards.
This is useful when the user combines dm-error with other
discard-supporting targets in the same table; without dm-error
support, discards would be disabled for the whole combined device.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Add zero_io_hints() and set discard limits so that the zero target
advertises support for discards.
The zero target will ignore discards.
This is useful when the user combines dm-zero with other
discard-supporting targets in the same table; without dm-zero support,
discards would be disabled for the whole combined device.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
In dm_calculate_queue_limits, add call to ->io_hints hook if the
target doesn't provide ->iterate_devices.
This is needed so the "error" and "zero" targets may support
discards. The 2 following commits will add their respective discard
support.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
dm-verity signals integrity violations by returning I/O errors
to user space. To identify integrity violations by a controlling
instance, the kernel audit subsystem can be used to emit audit
events to user space. Analogous to dm-integrity, we also use the
dm-audit submodule allowing to emit audit events on verification
failures of metadata and data blocks as well as if max corrupted
errors are reached.
The construction and destruction of verity device mappings are
also relevant for auditing a system. Thus, those events are also
logged as audit events.
Tested by starting a container with the container manager (cmld) of
GyroidOS which uses a dm-verity protected rootfs image root.img mapped
to /dev/mapper/<uuid>-root. One block was manipulated in the
underlying image file and repeated reads of the verity device were
performed again until the max corrupted errors is reached, e.g.:
dd if=/dev/urandom of=root.img bs=512 count=1 seek=1000
for i in range {1..101}; do \
dd if=/dev/mapper/<uuid>-root of=/dev/null bs=4096 \
count=1 skip=1000 \
done
The resulting audit log looks as follows:
type=DM_CTRL msg=audit(1677618791.876:962):
module=verity op=ctr ppid=4876 pid=29102 auid=0 uid=0 gid=0
euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=44
comm="cmld" exe="/usr/sbin/cml/cmld" subj=unconfined
dev=254:3 error_msg='success' res=1
type=DM_EVENT msg=audit(1677619463.786:1074): module=verity
op=verify-data dev=7:0 sector=1000 res=0
...
type=DM_EVENT msg=audit(1677619596.727:1162): module=verity
op=verify-data dev=7:0 sector=1000 res=0
type=DM_EVENT msg=audit(1677619596.731:1163): module=verity
op=max-corrupted-errors dev=254:3 sector=? res=0
Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
In verity_end_io(), if bi_status is not BLK_STS_OK, it can be return
directly. But if FEC configured, it is desired to correct the data page
through verity_verify_io. And the return value will be converted to
blk_status and passed to verity_finish_io().
BTW, when a bit is set in v->validated_blocks, verity_verify_io() skips
verification regardless of I/O error for the corresponding bio. In this
case, the I/O error could not be returned properly, and as a result,
there is a problem that abnormal data could be read for the
corresponding block.
To fix this problem, when an I/O error occurs, do not skip verification
even if the bit related is set in v->validated_blocks.
Fixes: 843f38d382 ("dm verity: add 'check_at_most_once' option to only validate hashes once")
Cc: stable@vger.kernel.org
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Yeongjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Both bufio and bio-prison-v1 use the identical model for splitting
their respective locks and rbtrees. Improve dm_num_hash_locks() to
distribute across more rbtrees to improve overall performance -- but
the maximum number of locks/rbtrees is still 64.
Also factor out a common hash function named dm_hash_locks_index(),
the magic numbers used were determined to be best using this program:
https://gist.github.com/jthornber/e05c47daa7b500c56dc339269c5467fc
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Add num_locks member to dm_bio_prison struct and use it rather than
the NR_LOCKS magic value (64).
Next commit will size the dm_bio_prison's prison_regions according to
dm_num_hash_locks().
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Add num_locks member to dm_buffer_cache struct and use it rather than
the NR_LOCKS magic value (64).
Next commit will size the dm_buffer_cache's buffer_trees according to
dm_num_hash_locks().
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Simple helper to use when DM core code needs to appropriately size,
based on num_online_cpus(), its data structures that split locks.
dm_num_hash_locks() rounds up num_online_cpus() to next power of 2
but caps return at DM_HASH_LOCKS_MAX (64).
This heuristic may evolve as warranted, but as-is it will serve as a
more informed basis for sizing the sharded lock structs in dm-bufio's
dm_buffer_cache (buffer_trees) and dm-bio-prison-v1's dm_bio_prison
(prison_regions).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Don't have bio_detain() BUG_ON if a dm_cell_key is beyond
BIO_PRISON_MAX_RANGE or spans a boundary.
Update dm-thin.c:build_key() to use dm_cell_key_has_valid_range() which
will do this checking without using BUG_ON. Also update
process_discard_bio() to check the discard bio that DM core passes in
(having first imposed max_discard_granularity based splitting).
dm_cell_key_has_valid_range() will merely WARN_ON_ONCE if it returns
false because if it does: it is programmer error that should be caught
with proper testing. So relax the BUG_ONs to be WARN_ON_ONCE.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Split the bio prison into multiple regions, with a separate rbtree and
associated lock for each region.
To get fast bio prison locking and not damage the performance of
discards too much the bio-prison now stipulates that discards should
not cross a BIO_PRISON_MAX_RANGE boundary.
Because the range of a key (block_end - block_begin) must not exceed
BIO_PRISON_MAX_RANGE: break_up_discard_bio() now ensures the data
range reflected in PHYSICAL key doesn't exceed BIO_PRISON_MAX_RANGE.
And splitting the thin target's discards (handled with VIRTUAL key) is
achieved by updating dm-thin.c to set limits->max_discard_sectors in
terms of BIO_PRISON_MAX_RANGE _and_ setting the thin and thin-pool
targets' max_discard_granularity to true.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The block core (bio_split_discard) will already split discards based
on the 'discard_granularity' and 'max_discard_sectors' queue_limits.
But the DM thin target also needs to ensure that it doesn't receive a
discard that spans a 'max_discard_sectors' boundary.
Introduce a dm_target 'max_discard_granularity' flag that if set will
cause DM core to split discard bios relative to 'max_discard_sectors'.
This treats 'discard_granularity' as a "min_discard_granularity" and
'max_discard_sectors' as a "max_discard_granularity".
Requested-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reduce the time that a spinlock is held in cell_defer_no_holder().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The kernel supports multi page bio vector entries, so we can use them
in dm-bufio as an optimization.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Save one spinlock by using waitqueue_active. We hold the bufio lock at
this place, so no one can add entries to the waitqueue at this point.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Movement also consolidates holes in dm_bufio_client struct. But the
overall size of the struct isn't changed.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Sometimes it is beneficial to repeatedly get and drop locks as part of
an iteration. Introduce lock_history struct to help avoid redundant
drop and gets of the same lock.
Optimizes cache_iterate, cache_mark_many and cache_evict.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
When multiple threads perform IO to a thin device, the underlying
dm_bufio object can become a bottleneck; slowing down access to btree
nodes that store the thin metadata. Prior to this commit, each bufio
instance had a single mutex that was taken for every bufio operation.
This commit concentrates on improving the common case where: a user of
dm_bufio wishes to access, but not modify, a buffer which is already
within the dm_bufio cache.
Implementation::
The code has been refactored; pulling out an 'lru' abstraction and a
'buffer cache' abstraction (see 2 previous commits). This commit
updates higher level bufio code (that performs allocation of buffers,
IO and eviction/cache sizing) to leverage both abstractions. It also
deals with the delicate locking requirements of both abstractions to
provide finer grained locking. The result is significantly better
concurrent IO performance.
Before this commit, bufio has a global lru list it used to evict the
oldest, clean buffers from _all_ clients. With the new locking we
don’t want different ways to access the same buffer, so instead
do_global_cleanup() loops around the clients asking them to free
buffers older than a certain time.
This commit also converts many old BUG_ONs to WARN_ON_ONCE, see the
lru_evict and cache_evict code in particular. They will return
ER_DONT_EVICT if a given buffer somehow meets the invariants that
should _never_ happen. [Aside from revising this commit's header and
fixing coding style and whitespace nits: this switching to
WARN_ON_ONCE is Mike Snitzer's lone contribution to this commit]
Testing::
Some of the low level functions have been unit tested using dm-unit:
https://github.com/jthornber/dm-unit/blob/main/src/tests/bufio.rs
Higher level concurrency and IO is tested via a test only target
found here:
https://github.com/jthornber/linux/blob/2023-03-24-thin-concurrency-9/drivers/md/dm-bufio-test.c
The associated userland side of these tests is here:
https://github.com/jthornber/dmtest-python/blob/main/src/dmtest/bufio/bufio_tests.py
In addition the full dmtest suite of tests (dm-thin, dm-cache, etc)
has been run (~450 tests).
Performance::
Most bufio operations have unchanged performance. But if multiple
threads are attempting to get buffers concurrently, and these
buffers are already in the cache then there's a big speed up. Eg,
one test has 16 'hotspot' threads simulating btree lookups while
another thread dirties the whole device. In this case the hotspot
threads acquire the buffers about 25 times faster.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The buffer cache is responsible for managing the holder count,
tracking clean/dirty state, and choosing buffers via predicates.
Higher level code is responsible for allocation of buffers, IO and
eviction/cache sizing.
The buffer cache has thread safe methods for acquiring a reference
to an existing buffer. All other methods in buffer cache are _not_
threadsafe, and only contain enough locking to guarantee the safe
methods.
Rather than a single mutex, sharded rw_semaphores are used to allow
concurrent threads to 'get' buffers. Each rw_semaphore protects its
own rbtree of buffer entries.
Code that uses this new dm_buffer_cache abstraction will be introduced
in a following commit.
This commit moves the dm_buffer struct in preparation for finer grained
dm_buffer changes, in the next commit, to be more easily seen. It also
introduces temporary dm_buffer struct members to allow compilation of
this intermediate commit (they will be elided in the next commit).
This commit will cause "defined but not used" compiler warnings that
will be resolved by the next commit.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
A CLOCK algorithm is used in this LRU abstraction. This avoids
relinking list nodes, which would require a write lock protecting it.
None of the LRU methods are threadsafe; locking must be done at a
higher level.
Code that uses this new LRU will be introduced in the next 2 commits.
As such, this commit will cause "defined but not used" compiler warnings
that will be resolved by the next 2 commits.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reasonable to relax to WARN_ON because these are easily avoided but do
offer some assurance future coding mistakes won't occur (if changes
tested properly).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
All these instances are entirely avoidable given that they speak to
coding mistakes that result in inappropriate use. Proper testing during
development will catch any such coding bug so its best to relax all of
these from BUG_ON to WARN_ON_ONCE.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Was used by multi-snapshot DM target that never went upstream.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Commit 7dd76d1fee ("dm: improve bio splitting and associated IO
accounting") only called setup_split_accounting() from
__send_duplicate_bios() if a single bio were being issued. But the case
where duplicate bios are issued must call it too.
Otherwise the bio won't be split and resubmitted (via recursion through
block core back to DM) to submit the later portions of a bio (which may
map to an entirely different target).
For example, when discarding an entire DM striped device with the
following DM table:
vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048
Before (broken, discards the first striped target's devices twice):
device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
device-mapper: striped: target_stripe=0, bdev=7:0, start=2049 len=22528
device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=22528
After (works as expected):
device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
device-mapper: striped: target_stripe=0, bdev=7:2, start=2048 len=22528
device-mapper: striped: target_stripe=1, bdev=7:3, start=2048 len=22528
Fixes: 7dd76d1fee ("dm: improve bio splitting and associated IO accounting")
Cc: stable@vger.kernel.org
Reported-by: Orange Kao <orange@aiven.io>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
"Abnormal" bios include discards, write zeroes and secure erase. By no
longer passing the calculated 'len' pointer, commit 7dd06a2548 ("dm:
allow dm_accept_partial_bio() for dm_io without duplicate bios") took a
senseless approach to disallowing dm_accept_partial_bio() from working
for duplicate bios processed using __send_duplicate_bios().
It inadvertently and incorrectly stopped the use of 'len' when
initializing a target's io (in alloc_tio). As such the resulting tio
could address more area of a device than it should.
For example, when discarding an entire DM striped device with the
following DM table:
vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048
Before this fix:
device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=102400
blkdiscard: attempt to access beyond end of device
loop0: rw=2051, sector=2048, nr_sectors = 102400 limit=81920
device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=102400
blkdiscard: attempt to access beyond end of device
loop1: rw=2051, sector=2048, nr_sectors = 102400 limit=81920
After this fix;
device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
Fixes: 7dd06a2548 ("dm: allow dm_accept_partial_bio() for dm_io without duplicate bios")
Cc: stable@vger.kernel.org
Reported-by: Orange Kao <orange@aiven.io>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
target flag (initially added to allow swap to dm-crypt) to throttle
the amount of outstanding swap bios.
- Fix DM crypt soft lockup warnings by calling cond_resched() from the
cpu intensive loop in dmcrypt_write().
- Fix DM crypt to not access an uninitialized tasklet. This fix allows
for consistent handling of IO completion, by _not_ needlessly punting
to a workqueue when tasklets are not needed.
- Fix DM core's alloc_dev() initialization for DM stats to check for
and propagate alloc_percpu() failure.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmQd2p0ACgkQxSPxCi2d
A1qvXAf/WCMNXRbFhO35QqukBqS7sUOMfWl1hIEdABRu+3Ul1KHBWzXVYWuWgebw
kr79V3LZG63cLvhreCy64X/0tXLZa0c0AGWZI6rJ/QAozSCs9R8BqOrJnB5GT1o9
/lvmOL31MloMnIKArWseIQViNM97gEHmFpuj0saqitcvNTjjipzxq/wOyhmDQwnE
8rxJpKSHBJXs9X/VyM9FTWxtijTQw3c8wxJJo7eV6TTuLyrErm46tyI1cBQ4vDoa
ogMVWVrf51uTsqL6DqGenDc+kO7CH5lipIJij1bTtKgs3aBNlaiZQC1nPkMST9Ue
hpH61ixAg+bsWi4/xLFafCl6QAGMlA==
=71ya
-----END PGP SIGNATURE-----
Merge tag 'for-6.3/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix DM thin to work as a swap device by using 'limit_swap_bios' DM
target flag (initially added to allow swap to dm-crypt) to throttle
the amount of outstanding swap bios.
- Fix DM crypt soft lockup warnings by calling cond_resched() from the
cpu intensive loop in dmcrypt_write().
- Fix DM crypt to not access an uninitialized tasklet. This fix allows
for consistent handling of IO completion, by _not_ needlessly punting
to a workqueue when tasklets are not needed.
- Fix DM core's alloc_dev() initialization for DM stats to check for
and propagate alloc_percpu() failure.
* tag 'for-6.3/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm stats: check for and propagate alloc_percpu failure
dm crypt: avoid accessing uninitialized tasklet
dm crypt: add cond_resched() to dmcrypt_write()
dm thin: fix deadlock when swapping to thin device
Check alloc_precpu()'s return value and return an error from
dm_stats_init() if it fails. Update alloc_dev() to fail if
dm_stats_init() does.
Otherwise, a NULL pointer dereference will occur in dm_stats_cleanup()
even if dm-stats isn't being actively used.
Fixes: fd2ed4d252 ("dm: add statistics support")
Cc: stable@vger.kernel.org
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Pull MD fixes from Song:
"This set contains two fixes for old issues (by Neil) and one fix
for 6.3 (by Xiao)."
* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: select BLOCK_LEGACY_AUTOLOAD
md: avoid signed overflow in slot_store()
md: Free resources in __md_stop
When BLOCK_LEGACY_AUTOLOAD is not enable, mdadm is not able to
activate new arrays unless "CREATE names=yes" appears in
mdadm.conf
As this is a regression we need to always enable BLOCK_LEGACY_AUTOLOAD
for when MD is selected - at least until mdadm is updated and the
updates widely available.
Cc: stable@vger.kernel.org # v5.18+
Fixes: fbdee71bb5 ("block: deprecate autoloading based on dev_t")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
While using iostat for raid, I observed very strange 'await'
occasionally, and turns out it's due to that 'ios' and 'sectors' is
counted in bdev_start_io_acct(), while 'nsecs' is counted in
bdev_end_io_acct(). I'm not sure why they are ccounted like that
but I think this behaviour is obviously wrong because user will get
wrong disk stats.
Fix the problem by counting 'ios' and 'sectors' when io is done, like
what rq-based device does.
Fixes: 394ffa503b ("blk: introduce generic io stat accounting help function")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230223091226.1135678-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
slot_store() uses kstrtouint() to get a slot number, but stores the
result in an "int" variable (by casting a pointer).
This can result in a negative slot number if the unsigned int value is
very large.
A negative number means that the slot is empty, but setting a negative
slot number this way will not remove the device from the array. I don't
think this is a serious problem, but it could cause confusion and it is
best to fix it.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
If md_run() fails after ->active_io is initialized, then percpu_ref_exit
is called in error path. However, later md_free_disk will call
percpu_ref_exit again which leads to a panic because of null pointer
dereference. It can also trigger this bug when resources are initialized
but are freed in error path, then will be freed again in md_free_disk.
BUG: kernel NULL pointer dereference, address: 0000000000000038
Oops: 0000 [#1] PREEMPT SMP
Workqueue: md_misc mddev_delayed_delete
RIP: 0010:free_percpu+0x110/0x630
Call Trace:
<TASK>
__percpu_ref_exit+0x44/0x70
percpu_ref_exit+0x16/0x90
md_free_disk+0x2f/0x80
disk_release+0x101/0x180
device_release+0x84/0x110
kobject_put+0x12a/0x380
kobject_put+0x160/0x380
mddev_delayed_delete+0x19/0x30
process_one_work+0x269/0x680
worker_thread+0x266/0x640
kthread+0x151/0x1b0
ret_from_fork+0x1f/0x30
For creating raid device, md raid calls do_md_run->md_run, dm raid calls
md_run. We alloc those memory in md_run. For stopping raid device, md raid
calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can
free those memory resources in __md_stop.
Fixes: 72adae23a7 ("md: Change active_io to percpu")
Reported-and-tested-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
When neither "no_read_workqueue" nor "no_write_workqueue" are enabled,
tasklet_trylock() in crypt_dec_pending() may still return false due to
an uninitialized state, and dm-crypt will unnecessarily do io completion
in io_queue workqueue instead of current context.
Fix this by adding an 'in_tasklet' flag to dm_crypt_io struct and
initialize it to false in crypt_io_init(). Set this flag to true in
kcryptd_queue_crypt() before calling tasklet_schedule(). If set
crypt_dec_pending() will punt io completion to a workqueue.
This also nicely avoids the tasklet_trylock/unlock hack when tasklets
aren't in use.
Fixes: 8e14f61015 ("dm crypt: do not call bio_endio() from the dm-crypt tasklet")
Cc: stable@vger.kernel.org
Reported-by: Hou Tao <houtao1@huawei.com>
Suggested-by: Ignat Korchagin <ignat@cloudflare.com>
Reviewed-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>