Commit Graph

7459 Commits

Author SHA1 Message Date
Li Lingfeng
38d11da522 dm: don't lock fs when the map is NULL in process of resume
Commit fa247089de ("dm: requeue IO if mapping table not yet available")
added a detection of whether the mapping table is available in the IO
submission process. If the mapping table is unavailable, it returns
BLK_STS_RESOURCE and requeues the IO.
This can lead to the following deadlock problem:

dm create                                      mount
ioctl(DM_DEV_CREATE_CMD)
ioctl(DM_TABLE_LOAD_CMD)
                               do_mount
                                vfs_get_tree
                                 ext4_get_tree
                                  get_tree_bdev
                                   sget_fc
                                    alloc_super
                                     // got &s->s_umount
                                     down_write_nested(&s->s_umount, ...);
                                   ext4_fill_super
                                    ext4_load_super
                                     ext4_read_bh
                                      submit_bio
                                      // submit and wait io end
ioctl(DM_DEV_SUSPEND_CMD)
dev_suspend
 do_resume
  dm_suspend
   __dm_suspend
    lock_fs
     freeze_bdev
      get_active_super
       grab_super
        // wait for &s->s_umount
        down_write(&s->s_umount);
  dm_swap_table
   __bind
    // set md->map(can't get here)

IO will be continuously requeued while holding the lock since mapping
table is NULL. At the same time, mapping table won't be set since the
lock is not available.
Like request-based DM, bio-based DM also has the same problem.

It's not proper to just abort IO if the mapping table not available.
So clear DM_SKIP_LOCKFS_FLAG when the mapping table is NULL, this
allows the DM table to be loaded and the IO submitted upon resume.

Fixes: fa247089de ("dm: requeue IO if mapping table not yet available")
Cc: stable@vger.kernel.org
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-19 11:51:37 -04:00
Mikulas Patocka
aa7d7bc99f dm flakey: add an "error_reads" option
dm-flakey returns error on reads if no other argument is specified.
This commit simplifies associated logic while formalizing an
"error_reads" argument and an ERROR_READS flag.

If no argument is specified, set ERROR_READS flag so that it behaves
just like before this commit.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-19 11:13:43 -04:00
Mikulas Patocka
e3675dc1e7 dm flakey: remove trailing space in the table line
Don't return a trailing space in the output of STATUSTYPE_TABLE.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-19 11:07:47 -04:00
Mikulas Patocka
98dba02d9a dm flakey: fix a crash with invalid table line
This command will crash with NULL pointer dereference:
 dmsetup create flakey --table \
  "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 2 corrupt_bio_byte 512"

Fix the crash by checking if arg_name is non-NULL before comparing it.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-19 11:06:36 -04:00
Mike Snitzer
3d32aaa7e6 dm ioctl: fix nested locking in table_clear() to remove deadlock concern
syzkaller found the following problematic rwsem locking (with write
lock already held):

 down_read+0x9d/0x450 kernel/locking/rwsem.c:1509
 dm_get_inactive_table+0x2b/0xc0 drivers/md/dm-ioctl.c:773
 __dev_status+0x4fd/0x7c0 drivers/md/dm-ioctl.c:844
 table_clear+0x197/0x280 drivers/md/dm-ioctl.c:1537

In table_clear, it first acquires a write lock
https://elixir.bootlin.com/linux/v6.2/source/drivers/md/dm-ioctl.c#L1520
down_write(&_hash_lock);

Then before the lock is released at L1539, there is a path shown above:
table_clear -> __dev_status -> dm_get_inactive_table ->  down_read
https://elixir.bootlin.com/linux/v6.2/source/drivers/md/dm-ioctl.c#L773
down_read(&_hash_lock);

It tries to acquire the same read lock again, resulting in the deadlock
problem.

Fix this by moving table_clear()'s __dev_status() call to after its
up_write(&_hash_lock);

Cc: stable@vger.kernel.org
Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-17 12:15:55 -04:00
Mike Snitzer
f7995089c5 dm: unexport dm_get_queue_limits()
There are no dm_get_queue_limits() callers outside of DM core and
there shouldn't be.

Also, remove its BUG_ON(!atomic_read(&md->holders)) to micro-optimize
__process_abnormal_io().

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-14 12:32:20 -04:00
Mike Snitzer
13f6facf3f dm: allow targets to require splitting WRITE_ZEROES and SECURE_ERASE
Introduce max_write_zeroes_granularity and
max_secure_erase_granularity flags in the dm_target struct.

If a target sets these then DM core will split IO of these operation
types accordingly (in terms of max_write_zeroes_sectors and
max_secure_erase_sectors respectively).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-14 12:07:27 -04:00
Yangtao Li
3664ff82da dm: add helper macro for simple DM target module init and exit
Eliminate duplicate boilerplate code for simple modules that contain
a single DM target driver without any additional setup code.

Add a new module_dm() macro, which replaces the module_init() and
module_exit() with template functions that call dm_register_target()
and dm_unregister_target() respectively.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-11 12:09:08 -04:00
Tom Rix
306fbc2e04 dm raid: remove unused d variable
clang with W=1 reports
drivers/md/dm-raid.c:2212:15: error: variable
  'd' set but not used [-Werror,-Wunused-but-set-variable]
        unsigned int d;
                     ^
This variable is not used so remove it.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-11 12:01:01 -04:00
Yu Zhe
26cb62a285 dm: remove unnecessary (void*) conversions
Pointer variables of void * type do not require type cast.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-11 12:01:01 -04:00
Yangtao Li
990f61e43c dm mirror: add DMERR message if alloc_workqueue fails
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-11 12:01:01 -04:00
Yangtao Li
b362c733ed dm: push error reporting down to dm_register_target()
Simplifies each DM target's init method by making dm_register_target()
responsible for its error reporting (on behalf of targets).

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-11 12:01:01 -04:00
Mike Snitzer
6b79a428c0 dm integrity: call kmem_cache_destroy() in dm_integrity_init() error path
Otherwise the journal_io_cache will leak if dm_register_target() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:39:25 -04:00
Mike Snitzer
6827af4a9a dm clone: call kmem_cache_destroy() in dm_clone_init() error path
Otherwise the _hydration_cache will leak if dm_register_target() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:17 -04:00
Mikulas Patocka
b6bcb84446 dm error: add discard support
Add io_err_io_hints() and set discard limits so that the zero target
advertises support for discards.

The error target will return -EIO for discards.

This is useful when the user combines dm-error with other
discard-supporting targets in the same table; without dm-error
support, discards would be disabled for the whole combined device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:17 -04:00
Mikulas Patocka
00065f925e dm zero: add discard support
Add zero_io_hints() and set discard limits so that the zero target
advertises support for discards.

The zero target will ignore discards.

This is useful when the user combines dm-zero with other
discard-supporting targets in the same table; without dm-zero support,
discards would be disabled for the whole combined device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:17 -04:00
Mikulas Patocka
85c938e891 dm table: allow targets without devices to set ->io_hints
In dm_calculate_queue_limits, add call to ->io_hints hook if the
target doesn't provide ->iterate_devices.

This is needed so the "error" and "zero" targets may support
discards. The 2 following commits will add their respective discard
support.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:17 -04:00
Michael Weiß
074c44664f dm verity: emit audit events on verification failure and more
dm-verity signals integrity violations by returning I/O errors
to user space. To identify integrity violations by a controlling
instance, the kernel audit subsystem can be used to emit audit
events to user space. Analogous to dm-integrity, we also use the
dm-audit submodule allowing to emit audit events on verification
failures of metadata and data blocks as well as if max corrupted
errors are reached.

The construction and destruction of verity device mappings are
also relevant for auditing a system. Thus, those events are also
logged as audit events.

Tested by starting a container with the container manager (cmld) of
GyroidOS which uses a dm-verity protected rootfs image root.img mapped
to /dev/mapper/<uuid>-root. One block was manipulated in the
underlying image file and repeated reads of the verity device were
performed again until the max corrupted errors is reached, e.g.:

  dd if=/dev/urandom of=root.img bs=512 count=1 seek=1000
  for i in range {1..101}; do \
    dd if=/dev/mapper/<uuid>-root of=/dev/null bs=4096 \
       count=1 skip=1000 \
  done

The resulting audit log looks as follows:

  type=DM_CTRL msg=audit(1677618791.876:962):
    module=verity op=ctr ppid=4876 pid=29102 auid=0 uid=0 gid=0
    euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=44
    comm="cmld" exe="/usr/sbin/cml/cmld" subj=unconfined
    dev=254:3 error_msg='success' res=1

  type=DM_EVENT msg=audit(1677619463.786:1074): module=verity
    op=verify-data dev=7:0 sector=1000 res=0
  ...
  type=DM_EVENT msg=audit(1677619596.727:1162): module=verity
    op=verify-data dev=7:0 sector=1000 res=0

  type=DM_EVENT msg=audit(1677619596.731:1163): module=verity
    op=max-corrupted-errors dev=254:3 sector=? res=0

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 13:30:05 -04:00
Yeongjin Gil
e8c5d45f82 dm verity: fix error handling for check_at_most_once on FEC
In verity_end_io(), if bi_status is not BLK_STS_OK, it can be return
directly. But if FEC configured, it is desired to correct the data page
through verity_verify_io. And the return value will be converted to
blk_status and passed to verity_finish_io().

BTW, when a bit is set in v->validated_blocks, verity_verify_io() skips
verification regardless of I/O error for the corresponding bio. In this
case, the I/O error could not be returned properly, and as a result,
there is a problem that abnormal data could be read for the
corresponding block.

To fix this problem, when an I/O error occurs, do not skip verification
even if the bit related is set in v->validated_blocks.

Fixes: 843f38d382 ("dm verity: add 'check_at_most_once' option to only validate hashes once")
Cc: stable@vger.kernel.org
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Yeongjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-04 11:03:39 -04:00
Joe Thornber
363b7fd76c dm: improve hash_locks sizing and hash function
Both bufio and bio-prison-v1 use the identical model for splitting
their respective locks and rbtrees. Improve dm_num_hash_locks() to
distribute across more rbtrees to improve overall performance -- but
the maximum number of locks/rbtrees is still 64.

Also factor out a common hash function named dm_hash_locks_index(),
the magic numbers used were determined to be best using this program:
 https://gist.github.com/jthornber/e05c47daa7b500c56dc339269c5467fc

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
b6279f82eb dm bio prison v1: intelligently size dm_bio_prison's prison_regions
Size the dm_bio_prison's number of prison_region structs using
dm_num_hash_locks().

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
c6273411d1 dm bio prison v1: prepare to intelligently size dm_bio_prison's prison_regions
Add num_locks member to dm_bio_prison struct and use it rather than
the NR_LOCKS magic value (64).

Next commit will size the dm_bio_prison's prison_regions according to
dm_num_hash_locks().

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
1e84c4b732 dm bufio: intelligently size dm_buffer_cache's buffer_trees
Size the dm_buffer_cache's number of buffer_tree structs using
dm_num_hash_locks().

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
36c18b8639 dm bufio: prepare to intelligently size dm_buffer_cache's buffer_trees
Add num_locks member to dm_buffer_cache struct and use it rather than
the NR_LOCKS magic value (64).

Next commit will size the dm_buffer_cache's buffer_trees according to
dm_num_hash_locks().

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
0bac3f2f28 dm: add dm_num_hash_locks()
Simple helper to use when DM core code needs to appropriately size,
based on num_online_cpus(), its data structures that split locks.

dm_num_hash_locks() rounds up num_online_cpus() to next power of 2
but caps return at DM_HASH_LOCKS_MAX (64).

This heuristic may evolve as warranted, but as-is it will serve as a
more informed basis for sizing the sharded lock structs in dm-bufio's
dm_buffer_cache (buffer_trees) and dm-bio-prison-v1's dm_bio_prison
(prison_regions).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
3f8d3f5432 dm bio prison v1: add dm_cell_key_has_valid_range
Don't have bio_detain() BUG_ON if a dm_cell_key is beyond
BIO_PRISON_MAX_RANGE or spans a boundary.

Update dm-thin.c:build_key() to use dm_cell_key_has_valid_range() which
will do this checking without using BUG_ON. Also update
process_discard_bio() to check the discard bio that DM core passes in
(having first imposed max_discard_granularity based splitting).

dm_cell_key_has_valid_range() will merely WARN_ON_ONCE if it returns
false because if it does: it is programmer error that should be caught
with proper testing. So relax the BUG_ONs to be WARN_ON_ONCE.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Joe Thornber
e2dd8aca2d dm bio prison v1: improve concurrent IO performance
Split the bio prison into multiple regions, with a separate rbtree and
associated lock for each region.

To get fast bio prison locking and not damage the performance of
discards too much the bio-prison now stipulates that discards should
not cross a BIO_PRISON_MAX_RANGE boundary.

Because the range of a key (block_end - block_begin) must not exceed
BIO_PRISON_MAX_RANGE: break_up_discard_bio() now ensures the data
range reflected in PHYSICAL key doesn't exceed BIO_PRISON_MAX_RANGE.
And splitting the thin target's discards (handled with VIRTUAL key) is
achieved by updating dm-thin.c to set limits->max_discard_sectors in
terms of BIO_PRISON_MAX_RANGE _and_ setting the thin and thin-pool
targets' max_discard_granularity to true.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:51 -04:00
Mike Snitzer
06961c487a dm: split discards further if target sets max_discard_granularity
The block core (bio_split_discard) will already split discards based
on the 'discard_granularity' and 'max_discard_sectors' queue_limits.
But the DM thin target also needs to ensure that it doesn't receive a
discard that spans a 'max_discard_sectors' boundary.

Introduce a dm_target 'max_discard_granularity' flag that if set will
cause DM core to split discard bios relative to 'max_discard_sectors'.
This treats 'discard_granularity' as a "min_discard_granularity" and
'max_discard_sectors' as a "max_discard_granularity".

Requested-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
bb46c56165 dm thin: speed up cell_defer_no_holder()
Reduce the time that a spinlock is held in cell_defer_no_holder().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mikulas Patocka
56c5de4406 dm bufio: use multi-page bio vector
The kernel supports multi page bio vector entries, so we can use them
in dm-bufio as an optimization.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mikulas Patocka
f5f9354120 dm bufio: use waitqueue_active in __free_buffer_wake
Save one spinlock by using waitqueue_active. We hold the bufio lock at
this place, so no one can add entries to the waitqueue at this point.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
530f683ddc dm bufio: move dm_bufio_client members to avoid spanning cachelines
Movement also consolidates holes in dm_bufio_client struct. But the
overall size of the struct isn't changed.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
791188065b dm bufio: add lock_history optimization for cache iterators
Sometimes it is beneficial to repeatedly get and drop locks as part of
an iteration.  Introduce lock_history struct to help avoid redundant
drop and gets of the same lock.

Optimizes cache_iterate, cache_mark_many and cache_evict.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
450e8dee51 dm bufio: improve concurrent IO performance
When multiple threads perform IO to a thin device, the underlying
dm_bufio object can become a bottleneck; slowing down access to btree
nodes that store the thin metadata. Prior to this commit, each bufio
instance had a single mutex that was taken for every bufio operation.

This commit concentrates on improving the common case where: a user of
dm_bufio wishes to access, but not modify, a buffer which is already
within the dm_bufio cache.

Implementation::

  The code has been refactored; pulling out an 'lru' abstraction and a
  'buffer cache' abstraction (see 2 previous commits). This commit
  updates higher level bufio code (that performs allocation of buffers,
  IO and eviction/cache sizing) to leverage both abstractions. It also
  deals with the delicate locking requirements of both abstractions to
  provide finer grained locking. The result is significantly better
  concurrent IO performance.

  Before this commit, bufio has a global lru list it used to evict the
  oldest, clean buffers from _all_ clients. With the new locking we
  don’t want different ways to access the same buffer, so instead
  do_global_cleanup() loops around the clients asking them to free
  buffers older than a certain time.

  This commit also converts many old BUG_ONs to WARN_ON_ONCE, see the
  lru_evict and cache_evict code in particular.  They will return
  ER_DONT_EVICT if a given buffer somehow meets the invariants that
  should _never_ happen. [Aside from revising this commit's header and
  fixing coding style and whitespace nits: this switching to
  WARN_ON_ONCE is Mike Snitzer's lone contribution to this commit]

Testing::

  Some of the low level functions have been unit tested using dm-unit:
    https://github.com/jthornber/dm-unit/blob/main/src/tests/bufio.rs

  Higher level concurrency and IO is tested via a test only target
  found here:
    https://github.com/jthornber/linux/blob/2023-03-24-thin-concurrency-9/drivers/md/dm-bufio-test.c

  The associated userland side of these tests is here:
    https://github.com/jthornber/dmtest-python/blob/main/src/dmtest/bufio/bufio_tests.py

  In addition the full dmtest suite of tests (dm-thin, dm-cache, etc)
  has been run (~450 tests).

Performance::

  Most bufio operations have unchanged performance. But if multiple
  threads are attempting to get buffers concurrently, and these
  buffers are already in the cache then there's a big speed up. Eg,
  one test has 16 'hotspot' threads simulating btree lookups while
  another thread dirties the whole device. In this case the hotspot
  threads acquire the buffers about 25 times faster.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
2cd7a6d41f dm bufio: add dm_buffer_cache abstraction
The buffer cache is responsible for managing the holder count,
tracking clean/dirty state, and choosing buffers via predicates.
Higher level code is responsible for allocation of buffers, IO and
eviction/cache sizing.

The buffer cache has thread safe methods for acquiring a reference
to an existing buffer. All other methods in buffer cache are _not_
threadsafe, and only contain enough locking to guarantee the safe
methods.

Rather than a single mutex, sharded rw_semaphores are used to allow
concurrent threads to 'get' buffers. Each rw_semaphore protects its
own rbtree of buffer entries.

Code that uses this new dm_buffer_cache abstraction will be introduced
in a following commit.

This commit moves the dm_buffer struct in preparation for finer grained
dm_buffer changes, in the next commit, to be more easily seen. It also
introduces temporary dm_buffer struct members to allow compilation of
this intermediate commit (they will be elided in the next commit).

This commit will cause "defined but not used" compiler warnings that
will be resolved by the next commit.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
be845babda dm bufio: add LRU abstraction
A CLOCK algorithm is used in this LRU abstraction.  This avoids
relinking list nodes, which would require a write lock protecting it.

None of the LRU methods are threadsafe; locking must be done at a
higher level.

Code that uses this new LRU will be introduced in the next 2 commits.

As such, this commit will cause "defined but not used" compiler warnings
that will be resolved by the next 2 commits.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
b75a80f4f5 dm bufio: don't bug for clear developer oversight
Reasonable to relax to WARN_ON because these are easily avoided but do
offer some assurance future coding mistakes won't occur (if changes
tested properly).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
0511228752 dm bufio: never crash if dm_bufio_in_request()
All these instances are entirely avoidable given that they speak to
coding mistakes that result in inappropriate use. Proper testing during
development will catch any such coding bug so its best to relax all of
these from BUG_ON to WARN_ON_ONCE.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
555977dd68 dm bufio: use WARN_ON in dm_bufio_client_destroy and dm_bufio_exit
Using BUG_ON when tearing down is excessive. Relax these to WARN_ONs.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Joe Thornber
96a2ff2a63 dm bufio: remove unused dm_bufio_release_move interface
Was used by multi-snapshot DM target that never went upstream.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:57:50 -04:00
Mike Snitzer
666eed4676 dm: fix __send_duplicate_bios() to always allow for splitting IO
Commit 7dd76d1fee ("dm: improve bio splitting and associated IO
accounting") only called setup_split_accounting() from
__send_duplicate_bios() if a single bio were being issued. But the case
where duplicate bios are issued must call it too.

Otherwise the bio won't be split and resubmitted (via recursion through
block core back to DM) to submit the later portions of a bio (which may
map to an entirely different target).

For example, when discarding an entire DM striped device with the
following DM table:
 vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
 vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048

Before (broken, discards the first striped target's devices twice):
 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
 device-mapper: striped: target_stripe=0, bdev=7:0, start=2049 len=22528
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=22528

After (works as expected):
 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
 device-mapper: striped: target_stripe=0, bdev=7:2, start=2048 len=22528
 device-mapper: striped: target_stripe=1, bdev=7:3, start=2048 len=22528

Fixes: 7dd76d1fee ("dm: improve bio splitting and associated IO accounting")
Cc: stable@vger.kernel.org
Reported-by: Orange Kao <orange@aiven.io>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:54:32 -04:00
Mike Snitzer
f7b58a69fa dm: fix improper splitting for abnormal bios
"Abnormal" bios include discards, write zeroes and secure erase. By no
longer passing the calculated 'len' pointer, commit 7dd06a2548 ("dm:
allow dm_accept_partial_bio() for dm_io without duplicate bios") took a
senseless approach to disallowing dm_accept_partial_bio() from working
for duplicate bios processed using __send_duplicate_bios().

It inadvertently and incorrectly stopped the use of 'len' when
initializing a target's io (in alloc_tio). As such the resulting tio
could address more area of a device than it should.

For example, when discarding an entire DM striped device with the
following DM table:
 vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
 vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048

Before this fix:

 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=102400
 blkdiscard: attempt to access beyond end of device
 loop0: rw=2051, sector=2048, nr_sectors = 102400 limit=81920

 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=102400
 blkdiscard: attempt to access beyond end of device
 loop1: rw=2051, sector=2048, nr_sectors = 102400 limit=81920

After this fix;

 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872

Fixes: 7dd06a2548 ("dm: allow dm_accept_partial_bio() for dm_io without duplicate bios")
Cc: stable@vger.kernel.org
Reported-by: Orange Kao <orange@aiven.io>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30 15:54:32 -04:00
Linus Torvalds
5ad4fe9613 - Fix DM thin to work as a swap device by using 'limit_swap_bios' DM
target flag (initially added to allow swap to dm-crypt) to throttle
   the amount of outstanding swap bios.
 
 - Fix DM crypt soft lockup warnings by calling cond_resched() from the
   cpu intensive loop in dmcrypt_write().
 
 - Fix DM crypt to not access an uninitialized tasklet. This fix allows
   for consistent handling of IO completion, by _not_ needlessly punting
   to a workqueue when tasklets are not needed.
 
 - Fix DM core's alloc_dev() initialization for DM stats to check for
   and propagate alloc_percpu() failure.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmQd2p0ACgkQxSPxCi2d
 A1qvXAf/WCMNXRbFhO35QqukBqS7sUOMfWl1hIEdABRu+3Ul1KHBWzXVYWuWgebw
 kr79V3LZG63cLvhreCy64X/0tXLZa0c0AGWZI6rJ/QAozSCs9R8BqOrJnB5GT1o9
 /lvmOL31MloMnIKArWseIQViNM97gEHmFpuj0saqitcvNTjjipzxq/wOyhmDQwnE
 8rxJpKSHBJXs9X/VyM9FTWxtijTQw3c8wxJJo7eV6TTuLyrErm46tyI1cBQ4vDoa
 ogMVWVrf51uTsqL6DqGenDc+kO7CH5lipIJij1bTtKgs3aBNlaiZQC1nPkMST9Ue
 hpH61ixAg+bsWi4/xLFafCl6QAGMlA==
 =71ya
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM thin to work as a swap device by using 'limit_swap_bios' DM
   target flag (initially added to allow swap to dm-crypt) to throttle
   the amount of outstanding swap bios.

 - Fix DM crypt soft lockup warnings by calling cond_resched() from the
   cpu intensive loop in dmcrypt_write().

 - Fix DM crypt to not access an uninitialized tasklet. This fix allows
   for consistent handling of IO completion, by _not_ needlessly punting
   to a workqueue when tasklets are not needed.

 - Fix DM core's alloc_dev() initialization for DM stats to check for
   and propagate alloc_percpu() failure.

* tag 'for-6.3/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm stats: check for and propagate alloc_percpu failure
  dm crypt: avoid accessing uninitialized tasklet
  dm crypt: add cond_resched() to dmcrypt_write()
  dm thin: fix deadlock when swapping to thin device
2023-03-24 14:20:48 -07:00
Jiasheng Jiang
d3aa3e060c dm stats: check for and propagate alloc_percpu failure
Check alloc_precpu()'s return value and return an error from
dm_stats_init() if it fails. Update alloc_dev() to fail if
dm_stats_init() does.

Otherwise, a NULL pointer dereference will occur in dm_stats_cleanup()
even if dm-stats isn't being actively used.

Fixes: fd2ed4d252 ("dm: add statistics support")
Cc: stable@vger.kernel.org
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-16 13:37:06 -04:00
Jens Axboe
23e5b9307e Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.3
Pull MD fixes from Song:

"This set contains two fixes for old issues (by Neil) and one fix
 for 6.3 (by Xiao)."

* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: select BLOCK_LEGACY_AUTOLOAD
  md: avoid signed overflow in slot_store()
  md: Free resources in __md_stop
2023-03-15 12:18:07 -06:00
NeilBrown
6c0f589883 md: select BLOCK_LEGACY_AUTOLOAD
When BLOCK_LEGACY_AUTOLOAD is not enable, mdadm is not able to
activate new arrays unless "CREATE names=yes" appears in
mdadm.conf

As this is a regression we need to always enable BLOCK_LEGACY_AUTOLOAD
for when MD is selected - at least until mdadm is updated and the
updates widely available.

Cc: stable@vger.kernel.org # v5.18+
Fixes: fbdee71bb5 ("block: deprecate autoloading based on dev_t")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
2023-03-15 11:12:14 -07:00
Yu Kuai
5f27571382 block: count 'ios' and 'sectors' when io is done for bio-based device
While using iostat for raid, I observed very strange 'await'
occasionally, and turns out it's due to that 'ios' and 'sectors' is
counted in bdev_start_io_acct(), while 'nsecs' is counted in
bdev_end_io_acct(). I'm not sure why they are ccounted like that
but I think this behaviour is obviously wrong because user will get
wrong disk stats.

Fix the problem by counting 'ios' and 'sectors' when io is done, like
what rq-based device does.

Fixes: 394ffa503b ("blk: introduce generic io stat accounting help function")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230223091226.1135678-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-15 09:25:04 -06:00
NeilBrown
3bc5729227 md: avoid signed overflow in slot_store()
slot_store() uses kstrtouint() to get a slot number, but stores the
result in an "int" variable (by casting a pointer).
This can result in a negative slot number if the unsigned int value is
very large.

A negative number means that the slot is empty, but setting a negative
slot number this way will not remove the device from the array.  I don't
think this is a serious problem, but it could cause confusion and it is
best to fix it.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
2023-03-13 12:50:54 -07:00
Xiao Ni
3e45352259 md: Free resources in __md_stop
If md_run() fails after ->active_io is initialized, then percpu_ref_exit
is called in error path. However, later md_free_disk will call
percpu_ref_exit again which leads to a panic because of null pointer
dereference. It can also trigger this bug when resources are initialized
but are freed in error path, then will be freed again in md_free_disk.

BUG: kernel NULL pointer dereference, address: 0000000000000038
Oops: 0000 [#1] PREEMPT SMP
Workqueue: md_misc mddev_delayed_delete
RIP: 0010:free_percpu+0x110/0x630
Call Trace:
 <TASK>
 __percpu_ref_exit+0x44/0x70
 percpu_ref_exit+0x16/0x90
 md_free_disk+0x2f/0x80
 disk_release+0x101/0x180
 device_release+0x84/0x110
 kobject_put+0x12a/0x380
 kobject_put+0x160/0x380
 mddev_delayed_delete+0x19/0x30
 process_one_work+0x269/0x680
 worker_thread+0x266/0x640
 kthread+0x151/0x1b0
 ret_from_fork+0x1f/0x30

For creating raid device, md raid calls do_md_run->md_run, dm raid calls
md_run. We alloc those memory in md_run. For stopping raid device, md raid
calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can
free those memory resources in __md_stop.

Fixes: 72adae23a7 ("md: Change active_io to percpu")
Reported-and-tested-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
2023-03-13 10:56:54 -07:00
Mike Snitzer
d9a02e016a dm crypt: avoid accessing uninitialized tasklet
When neither "no_read_workqueue" nor "no_write_workqueue" are enabled,
tasklet_trylock() in crypt_dec_pending() may still return false due to
an uninitialized state, and dm-crypt will unnecessarily do io completion
in io_queue workqueue instead of current context.

Fix this by adding an 'in_tasklet' flag to dm_crypt_io struct and
initialize it to false in crypt_io_init(). Set this flag to true in
kcryptd_queue_crypt() before calling tasklet_schedule(). If set
crypt_dec_pending() will punt io completion to a workqueue.

This also nicely avoids the tasklet_trylock/unlock hack when tasklets
aren't in use.

Fixes: 8e14f61015 ("dm crypt: do not call bio_endio() from the dm-crypt tasklet")
Cc: stable@vger.kernel.org
Reported-by: Hou Tao <houtao1@huawei.com>
Suggested-by: Ignat Korchagin <ignat@cloudflare.com>
Reviewed-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-09 10:04:38 -05:00